Inquiry about lyric flexibility and Text-Encoder (Embedding) compatibility in cover modes.

#3
by DucNguyenz - opened

Hello, I am using acestep.cpp for music covers and noticed that the current engine is optimized for the Qwen3-Embedding-0.6B model. I have a few questions regarding the cover modes and model scaling:

Lyric Flexibility: In cover mode, how strictly does the model adhere to the original melody when I provide a completely different set of lyrics? Is it designed to maintain the exact melodic structure while only swapping the vocal content?

Embedding Scale: If we were to use a larger Text-Encoder (such as the 4B or 8B versions), would it significantly improve the model's ability to interpret and align complex lyrics with the existing melody?

cover-nofsq vs. Standard cover: Does a "stronger" Embedding model provide a noticeable advantage in aligning lyrics to the original rhythm when using cover-nofsq (No Finite Scalar Quantization) compared to the standard cover mode?

Architecture Support: Currently, the engine returns an "[Server] Models: 1 LM, 0 Text-Enc, 1 DiT, 1 VAE, 0 LoRA" error when attempting to load larger GGUF embeddings (like the 8B version). Are there plans to support these larger architectures to improve overall text/lyric understanding and adherence?

Thank you for your amazing work!

Hey DucNguyenz, thanks for the kind words and for trying acestep.cpp!

Quick clarification first because it answers everything: there are TWO separate Qwen3 models in the pipeline and they do very different jobs.

  • Text-Enc (Qwen3-Embedding-0.6B): frozen encoder that feeds caption+lyrics vectors to the DiT. The DiT was trained end-to-end with this exact model. Its projection layers (1024 -> 2048) are baked into every DiT checkpoint. Architecturally locked to 0.6B, this is an upstream ACE-Step constraint.

  • LM (acestep-5Hz-lm, 0.6B / 1.7B / 4B): causal language model that generates metadata, lyrics and audio codes from your caption. This is the Compose step. The LM and DiT were co-trained on the same music data: the LM learned to produce 5Hz audio codes that drive the DiT flow matching, and the DiT learned to generate audio from those codes while articulating the lyrics from the text conditioning. They speak the same musical language. The 4B version produces better lyrics and more coherent codes.

The key is the frequency split between the two models. The LM operates at 5Hz: each token represents 200ms of music, using a vocabulary of 64000 learned codes (think of it like a learned MIDI, except the features are not human-designed but discovered from the training data). At this low frequency, the LM builds the global musical structure autoregressively, token by token, with creative sampling (temperature...) shaping the composition like writing a story sentence by sentence. The DiT then takes over at 25Hz (one frame every 40ms) and uses flow matching to fill in the high frequency details: timbre, transients, articulation of the lyrics, stereo imaging. The LM gives the big picture, the DiT renders the fine grain.

You can also skip the LM entirely and drive the DiT with real audio instead of LM-produced codes. That is what cover modes do: the source audio is VAE-encoded into 25Hz latents and fed directly to the DiT as context.

Neither model has any impact on the DiT text conditioning though, that is entirely handled by the frozen Text-Enc above.

Now your questions:

  • Lyrics in cover mode: it depends on which cover mode.

In standard cover, the source audio goes through a lossy FSQ roundtrip that degrades the latents. This conditioning is closer to what the DiT saw during training, so the text prompt and lyrics tend to take over: your new lyrics will strongly influence the result, and the melody from the source becomes more of a loose guide.

In cover-nofsq, the DiT receives clean VAE latents with no FSQ degradation, so the source structure is much more present. How much the lyrics vs the source dominate depends on audio_cover_strength (the fraction of DiT steps that hear the source context). Lower strength = lyrics take over, higher = source structure preserved.

Cover noise and Cover strength are two independent controls. Cover noise always blends the initial diffusion noise with clean VAE latents from the source, regardless of the cover mode. It anchors the starting point of the diffusion closer to the original audio. I built cover-nofsq so that the first DiT steps also receive clean VAE latents as context, not just the noise init. This constrains the DiT further, which gives better fidelity than Cover noise alone can achieve in standard cover, but it also means very complex music can overconstrain the model. Keep it for tracks with clear structure.

Also note that Turbo and SFT DiT models react very differently to Cover strength. With Turbo, a strength of 0.5 already stays very close to the source. With SFT models, you need to go much lower to avoid reproducing the source too faithfully and let the text prompt and lyrics actually shape the output.

  • Larger Text-Encoder: not possible with the current architecture. text_hidden_dim=1024 is hardcoded in the DiT config. The CondEncoder projection weights inside every DiT GGUF are sized for exactly 1024-dim input. Supporting a different embedding size would require retraining the DiT from scratch. What you CAN do is use the 4B LM for better lyrics and audio code generation, that works today.

  • cover-nofsq vs cover: heads up, cover-nofsq is my own R&D addition, it does not exist in the original ACE-Step project. I have a PR in preparation for the upstream Python repo but haven't had time to test it thoroughly yet, so if anyone with a Python instance wants to help validate it, reach out!

The difference is about source audio latent quality, not the text encoder. Both use the same 0.6B embedding for text conditioning. Standard cover runs a lossy FSQ roundtrip (25Hz -> 5Hz -> 25Hz) on the source latents before feeding them to the DiT, which degrades micro-timings and lets the DiT diverge creatively. cover-nofsq skips that roundtrip and feeds clean VAE latents directly, so the DiT stays much closer to the original structure. Think of it as: cover = reinterpretation, cover-nofsq = close remix.

  • The "0 Text-Enc" error: the model registry classifies GGUFs by their general.architecture metadata and only recognizes Qwen3-Embedding-0.6B as acestep-text-enc. A larger embedding model would have a different hidden size that does not match the DiT's baked-in 1024-dim projection, so it cannot be supported without retraining the DiT. Just make sure you have the 0.6B Qwen3-Embedding GGUF in your --models directory and it will be detected.

Hope that clears things up!

it's very close to perfect, allowing for complete replacement of lyrics with new ones or even different languages, while maintaining excellent alignment. This method significantly outperforms other cover methods. I'm quite impressed with Method (SDE Stochastic).

The CFG for the SFT model is also incredible, use it from 1 to 7, not much more. It helps to refine its tendency towards excessive detail, bringing it closer to Turbo (task text2music). It also allows you to structure and force the prompt style on task cover-nofsq (very low strength between 0.01 and 0.3; because few steps are enough to cond the SFT).

Sign up or log in to comment