oguzhanercan 's Collections Architectural Proposals
updated
Byte Latent Transformer: Patches Scale Better Than Tokens
Paper
• 2412.09871
• Published • 108
Causal Diffusion Transformers for Generative Modeling
Paper
• 2412.12095
• Published • 23
Tensor Product Attention Is All You Need
Paper
• 2501.06425
• Published • 90
TransMLA: Multi-head Latent Attention Is All You Need
Paper
• 2502.07864
• Published • 69
Transformers without Normalization
Paper
• 2503.10622
• Published • 172
LSNet: See Large, Focus Small
Paper
• 2503.23135
• Published • 11
DDT: Decoupled Diffusion Transformer
Paper
• 2504.05741
• Published • 77
Latent Diffusion Autoencoders: Toward Efficient and Meaningful
Unsupervised Representation Learning in Medical Imaging
Paper
• 2504.08635
• Published • 4
D^2iT: Dynamic Diffusion Transformer for Accurate Image Generation
Paper
• 2504.09454
• Published • 11
Efficient Generative Model Training via Embedded Representation Warmup
Paper
• 2504.10188
• Published • 12
Softpick: No Attention Sink, No Massive Activations with Rectified
Softmax
Paper
• 2504.20966
• Published • 31
Group Downsampling with Equivariant Anti-aliasing
Paper
• 2504.17258
• Published • 9
Paper
• 2505.14513
• Published • 29
LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer
Paper
• 2506.06952
• Published • 9
Marrying Autoregressive Transformer and Diffusion with Multi-Reference
Autoregression
Paper
• 2506.09482
• Published • 45
From Bytes to Ideas: Language Modeling with Autoregressive U-Nets
Paper
• 2506.14761
• Published • 17
Energy-Based Transformers are Scalable Learners and Thinkers
Paper
• 2507.02092
• Published • 69
Dynamic Chunking for End-to-End Hierarchical Sequence Modeling
Paper
• 2507.07955
• Published • 27
Region-based Cluster Discrimination for Visual Representation Learning
Paper
• 2507.20025
• Published • 20
PixNerd: Pixel Neural Field Diffusion
Paper
• 2507.23268
• Published • 52
Local Scale Equivariance with Latent Deep Equilibrium Canonicalizer
Paper
• 2508.14187
• Published • 4
Artificial Hippocampus Networks for Efficient Long-Context Modeling
Paper
• 2510.07318
• Published • 32
Paper
• 2511.11238
• Published • 39
mHC: Manifold-Constrained Hyper-Connections
Paper
• 2512.24880
• Published • 322
Scaling Embeddings Outperforms Scaling Experts in Language Models
Paper
• 2601.21204
• Published • 102
Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders
Paper
• 2601.16208
• Published • 55
Nested Learning: The Illusion of Deep Learning Architectures
Paper
• 2512.24695
• Published • 45
Locality-Attending Vision Transformer
Paper
• 2603.04892
• Published • 7
Paper
• 2603.08709
• Published • 16