new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Apr 14

FreeText: Training-Free Text Rendering in Diffusion Transformers via Attention Localization and Spectral Glyph Injection

Large-scale text-to-image (T2I) diffusion models excel at open-domain synthesis but still struggle with precise text rendering, especially for multi-line layouts, dense typography, and long-tailed scripts such as Chinese. Prior solutions typically require costly retraining or rigid external layout constraints, which can degrade aesthetics and limit flexibility. We propose FreeText, a training-free, plug-and-play framework that improves text rendering by exploiting intrinsic mechanisms of Diffusion Transformer (DiT) models. FreeText decomposes the problem into where to write and what to write. For where to write, we localize writing regions by reading token-wise spatial attribution from endogenous image-to-text attention, using sink-like tokens as stable spatial anchors and topology-aware refinement to produce high-confidence masks. For what to write, we introduce Spectral-Modulated Glyph Injection (SGMI), which injects a noise-aligned glyph prior with frequency-domain band-pass modulation to strengthen glyph structure and suppress semantic leakage (rendering the concept instead of the word). Extensive experiments on Qwen-Image, FLUX.1-dev, and SD3 variants across longText-Benchmark, CVTG, and our CLT-Bench show consistent gains in text readability while largely preserving semantic alignment and aesthetic quality, with modest inference overhead.

  • 6 authors
·
Jan 1

MAKIMA: Tuning-free Multi-Attribute Open-domain Video Editing via Mask-Guided Attention Modulation

Diffusion-based text-to-image (T2I) models have demonstrated remarkable results in global video editing tasks. However, their focus is primarily on global video modifications, and achieving desired attribute-specific changes remains a challenging task, specifically in multi-attribute editing (MAE) in video. Contemporary video editing approaches either require extensive fine-tuning or rely on additional networks (such as ControlNet) for modeling multi-object appearances, yet they remain in their infancy, offering only coarse-grained MAE solutions. In this paper, we present MAKIMA, a tuning-free MAE framework built upon pretrained T2I models for open-domain video editing. Our approach preserves video structure and appearance information by incorporating attention maps and features from the inversion process during denoising. To facilitate precise editing of multiple attributes, we introduce mask-guided attention modulation, enhancing correlations between spatially corresponding tokens and suppressing cross-attribute interference in both self-attention and cross-attention layers. To balance video frame generation quality and efficiency, we implement consistent feature propagation, which generates frame sequences by editing keyframes and propagating their features throughout the sequence. Extensive experiments demonstrate that MAKIMA outperforms existing baselines in open-domain multi-attribute video editing tasks, achieving superior results in both editing accuracy and temporal consistency while maintaining computational efficiency.

  • 11 authors
·
Dec 27, 2024

AdaIR: Adaptive All-in-One Image Restoration via Frequency Mining and Modulation

In the image acquisition process, various forms of degradation, including noise, haze, and rain, are frequently introduced. These degradations typically arise from the inherent limitations of cameras or unfavorable ambient conditions. To recover clean images from degraded versions, numerous specialized restoration methods have been developed, each targeting a specific type of degradation. Recently, all-in-one algorithms have garnered significant attention by addressing different types of degradations within a single model without requiring prior information of the input degradation type. However, these methods purely operate in the spatial domain and do not delve into the distinct frequency variations inherent to different degradation types. To address this gap, we propose an adaptive all-in-one image restoration network based on frequency mining and modulation. Our approach is motivated by the observation that different degradation types impact the image content on different frequency subbands, thereby requiring different treatments for each restoration task. Specifically, we first mine low- and high-frequency information from the input features, guided by the adaptively decoupled spectra of the degraded image. The extracted features are then modulated by a bidirectional operator to facilitate interactions between different frequency components. Finally, the modulated features are merged into the original input for a progressively guided restoration. With this approach, the model achieves adaptive reconstruction by accentuating the informative frequency subbands according to different input degradations. Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance on different image restoration tasks, including denoising, dehazing, deraining, motion deblurring, and low-light image enhancement. Our code is available at https://github.com/c-yn/AdaIR.

  • 6 authors
·
Mar 21, 2024 2

Few-shot Image Generation via Adaptation-Aware Kernel Modulation

Few-shot image generation (FSIG) aims to learn to generate new and diverse samples given an extremely limited number of samples from a domain, e.g., 10 training samples. Recent work has addressed the problem using transfer learning approach, leveraging a GAN pretrained on a large-scale source domain dataset and adapting that model to the target domain based on very limited target domain samples. Central to recent FSIG methods are knowledge preserving criteria, which aim to select a subset of source model's knowledge to be preserved into the adapted model. However, a major limitation of existing methods is that their knowledge preserving criteria consider only source domain/source task, and they fail to consider target domain/adaptation task in selecting source model's knowledge, casting doubt on their suitability for setups of different proximity between source and target domain. Our work makes two contributions. As our first contribution, we re-visit recent FSIG works and their experiments. Our important finding is that, under setups which assumption of close proximity between source and target domains is relaxed, existing state-of-the-art (SOTA) methods which consider only source domain/source task in knowledge preserving perform no better than a baseline fine-tuning method. To address the limitation of existing methods, as our second contribution, we propose Adaptation-Aware kernel Modulation (AdAM) to address general FSIG of different source-target domain proximity. Extensive experimental results show that the proposed method consistently achieves SOTA performance across source/target domains of different proximity, including challenging setups when source and target domains are more apart. Project Page: https://yunqing-me.github.io/AdAM/

  • 4 authors
·
Oct 29, 2022

Remedying Target-Domain Astigmatism for Cross-Domain Few-Shot Object Detection

Cross-domain few-shot object detection (CD-FSOD) aims to adapt pretrained detectors from a source domain to target domains with limited annotations, suffering from severe domain shifts and data scarcity problems. In this work, we find a previously overlooked phenomenon: models exhibit dispersed and unfocused attention in target domains, leading to imprecise localization and redundant predictions, just like a human cannot focus on visual objects. Therefore, we call it the target-domain Astigmatism problem. Analysis on attention distances across transformer layers reveals that regular fine-tuning inherently shows a trend to remedy this problem, but results are still far from satisfactory, which we aim to enhance in this paper. Biologically inspired by the human fovea-style visual system, we enhance the fine-tuning's inherent trend through a center-periphery attention refinement framework, which contains (1) a Positive Pattern Refinement module to reshape attention toward semantic objects using class-specific prototypes, simulating the visual center region; (2) a Negative Context Modulation module to enhance boundary discrimination by modeling background context, simulating the visual periphery region; and (3) a Textual Semantic Alignment module to strengthen center-periphery distinction through cross-modal cues. Our bio-inspired approach transforms astigmatic attention into focused patterns, substantially improving adaptation to target domains. Experiments on six challenging CD-FSOD benchmarks consistently demonstrate improved detection accuracy and establish new state-of-the-art results.

  • 4 authors
·
Mar 18

Fatigue-PINN: Physics-Informed Fatigue-Driven Motion Modulation and Synthesis

Fatigue modeling is essential for motion synthesis tasks to model human motions under fatigued conditions and biomechanical engineering applications, such as investigating the variations in movement patterns and posture due to fatigue, defining injury risk mitigation and prevention strategies, formulating fatigue minimization schemes and creating improved ergonomic designs. Nevertheless, employing data-driven methods for synthesizing the impact of fatigue on motion, receives little to no attention in the literature. In this work, we present Fatigue-PINN, a deep learning framework based on Physics-Informed Neural Networks, for modeling fatigued human movements, while providing joint-specific fatigue configurations for adaptation and mitigation of motion artifacts on a joint level, resulting in more realistic animations. To account for muscle fatigue, we simulate the fatigue-induced fluctuations in the maximum exerted joint torques by leveraging a PINN adaptation of the Three-Compartment Controller model to exploit physics-domain knowledge for improving accuracy. This model also introduces parametric motion alignment with respect to joint-specific fatigue, hence avoiding sharp frame transitions. Our results indicate that Fatigue-PINN accurately simulates the effects of externally perceived fatigue on open-type human movements being consistent with findings from real-world experimental fatigue studies. Since fatigue is incorporated in torque space, Fatigue-PINN provides an end-to-end encoder-decoder-like architecture, to ensure transforming joint angles to joint torques and vice-versa, thus, being compatible with motion synthesis frameworks operating on joint angles.

  • 2 authors
·
Feb 26, 2025

LoopFormer: Elastic-Depth Looped Transformers for Latent Reasoning via Shortcut Modulation

Looped Transformers have emerged as an efficient and powerful class of models for reasoning in the language domain. Recent studies show that these models achieve strong performance on algorithmic and reasoning tasks, suggesting that looped architectures possess an inductive bias toward latent reasoning. However, prior approaches fix the number of loop iterations during training and inference, leaving open the question of whether these models can flexibly adapt their computational depth under variable compute budgets. We introduce LoopFormer, a looped Transformer trained on variable-length trajectories to enable budget-conditioned reasoning. Our core contribution is a shortcut-consistency training scheme that aligns trajectories of different lengths, ensuring that shorter loops yield informative representations while longer loops continue to refine them. LoopFormer conditions each loop on the current time and step size, enabling representations to evolve consistently across trajectories of varying length rather than drifting or stagnating. Empirically, LoopFormer demonstrates robust performance on language modeling and reasoning benchmarks even under aggressive compute constraints, while scaling gracefully with additional budget. These results show that looped Transformers are inherently suited for adaptive language modeling, opening a path toward controllable and budget-aware large language models.

Hadamard product in deep learning: Introduction, Advances and Challenges

While convolution and self-attention mechanisms have dominated architectural design in deep learning, this survey examines a fundamental yet understudied primitive: the Hadamard product. Despite its widespread implementation across various applications, the Hadamard product has not been systematically analyzed as a core architectural primitive. We present the first comprehensive taxonomy of its applications in deep learning, identifying four principal domains: higher-order correlation, multimodal data fusion, dynamic representation modulation, and efficient pairwise operations. The Hadamard product's ability to model nonlinear interactions with linear computational complexity makes it particularly valuable for resource-constrained deployments and edge computing scenarios. We demonstrate its natural applicability in multimodal fusion tasks, such as visual question answering, and its effectiveness in representation masking for applications including image inpainting and pruning. This systematic review not only consolidates existing knowledge about the Hadamard product's role in deep learning architectures but also establishes a foundation for future architectural innovations. Our analysis reveals the Hadamard product as a versatile primitive that offers compelling trade-offs between computational efficiency and representational power, positioning it as a crucial component in the deep learning toolkit.

  • 5 authors
·
Apr 17, 2025

Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation

Multimodal Large Language Models (MLLMs) are increasingly being deployed as automated content moderators. Within this landscape, we uncover a critical threat: Adversarial Smuggling Attacks. Unlike adversarial perturbations (for misclassification) and adversarial jailbreaks (for harmful output generation), adversarial smuggling exploits the Human-AI capability gap. It encodes harmful content into human-readable visual formats that remain AI-unreadable, thereby evading automated detection and enabling the dissemination of harmful content. We classify smuggling attacks into two pathways: (1) Perceptual Blindness, disrupting text recognition; and (2) Reasoning Blockade, inhibiting semantic understanding despite successful text recognition. To evaluate this threat, we constructed SmuggleBench, the first comprehensive benchmark comprising 1,700 adversarial smuggling attack instances. Evaluations on SmuggleBench reveal that both proprietary (e.g., GPT-5) and open-source (e.g., Qwen3-VL) state-of-the-art models are vulnerable to this threat, producing Attack Success Rates (ASR) exceeding 90%. By analyzing the vulnerability through the lenses of perception and reasoning, we identify three root causes: the limited capabilities of vision encoders, the robustness gap in OCR, and the scarcity of domain-specific adversarial examples. We conduct a preliminary exploration of mitigation strategies, investigating the potential of test-time scaling (via CoT) and adversarial training (via SFT) to mitigate this threat. Our code is publicly available at https://github.com/zhihengli-casia/smugglebench.

  • 11 authors
·
Apr 7