Hanrui / IDEA_REPORT.md

Add files using upload-large-folder tool

40d87dd verified 5 days ago

10.7 kB

DFlash Improvement Ideas: Higher Acceptance Length Without Training

Goal: Improve DFlash's acceptance length (tau) and acceleration ratio using only inference-time modifications — no additional training.

Baseline: Qwen3-4B + z-lab/Qwen3-4B-DFlash-b16, block_size=16, math500 (10 samples, 512 tokens)

Baseline avg tau = 8.63, median = 8.0

Idea 1: Iterative Block Refinement (Multi-Step Denoising)⭐⭐⭐⭐⭐

Core Idea: Run the DFlash draft model multiple times on the same block. After each pass, use the sampled tokens as updated noise embeddings for the next pass, mimicking multi-step diffusion denoising.

Why it might work: DFlash currently uses a single forward pass to predict all block tokens from mask tokens. The initial mask embeddings carry no information about what the draft should generate. By iterating, each pass conditions on an increasingly informed noise context — the first pass gives a rough draft, the second pass refines it with better token embeddings as context.

Implementation complexity: Low. Just loop the draft forward pass 2-3 times, feeding output back as input. No KV cache across steps.

Expected improvement: +0.5 to +2.0 tau (denoising is the core mechanism of diffusion models — more steps should help).

Risk: Extra draft compute may negate speedup gains. Must keep step count low (2-3) to maintain wall-clock advantage.

Pilot result: [PENDING]

Idea 1 plus: Confidence-Gated Selective Redrafting

Core Idea: After the first draft pass, compute per-position entropy of the draft logits. If any position (especially early ones) has high entropy (>threshold), run a second draft pass with the partially-filled block as context. Only replace the high-entropy positions with the second pass's predictions.

Why it might work: High entropy at a position signals that the draft model is uncertain — these are the positions most likely to cause rejection. A second pass, now conditioned on a partially-correct draft, can refine exactly these problematic positions.

Implementation complexity: Medium. Two draft passes + entropy computation + selective replacement.

Expected improvement: +0.5 to +2.0 tau (targeted improvement where it matters most).

Risk: Extra compute for the second pass. Entropy threshold needs tuning per dataset/model.

Pilot result: `[PENDING]

Idea 2: N-Best Draft Proposals (Multi-Candidate Selection)⭐

Core Idea: Generate K candidate draft blocks (K=2-4) using different sampling strategies (greedy + temperature-based), then select the candidate with the highest aggregate log-probability under the draft model's own distribution.

Why it might work: Exact-match acceptance is binary — a single wrong token kills the entire suffix. By generating multiple candidates and picking the most confident one, we increase the probability that at least one candidate matches the target's greedy output. The confidence score acts as a proxy for "likely to match target."

Implementation complexity: Low-Medium. K forward passes per block, simple confidence scoring.

Expected improvement: +0.5 to +2.5 tau (especially for "unlucky" blocks where the default greedy choice is wrong).

Risk: K times the draft compute cost. Must keep K small. Confidence score may not perfectly correlate with acceptance.

Pilot result: [PENDING]

Idea 6: Token Recycling / Warm-Start Drafting⭐⭐⭐

Core Idea: When rejection occurs at position j in a block of B tokens, the rejected tokens at positions j+1..B are discarded. Instead, save these tokens and use them to warm-start the noise embeddings of the next draft block. This gives the draft model a better starting point than random mask tokens.

Why it might work: Even though the prefix was wrong, later tokens in the rejected draft may still carry useful distributional information about the continuation. Using them as initial noise (instead of mask tokens) gives the draft model more context for its single-pass prediction.

Implementation complexity: Low. Save rejected suffix, inject into next block's initial embeddings.

Expected improvement: +0.3 to +1.0 tau (modest, since the recycled tokens are conditioned on a wrong prefix).

Risk: Recycled tokens may actually mislead the draft model if they were generated from a very different prefix. Net effect could be negative.

Pilot result: [PENDING]

Idea 9: Dynamic Target Layer Selection

Core Idea: Instead of always extracting features from the same 5 fixed target layers, try alternative layer selections (e.g., shifted by +2 or -2) and pick the one that produces the highest-confidence draft. Different parts of the sequence may benefit from different layers.

Why it might work: The paper's ablation (Table 5) shows that layer selection affects acceptance length. The optimal layers may vary by position in the sequence or by the type of content being generated. Late layers have more "final answer" information; early layers have more syntactic/structural information.

Implementation complexity: Medium. Multiple draft passes with different layer configs + scoring.

Expected improvement: +0.3 to +1.5 tau (if the fixed layers are suboptimal for certain content types).

Risk: The draft model's fc projection was trained on specific layer combinations. Using different layers degrades the learned alignment. Needs the fc layer to generalize.

Pilot result: [PENDING]

Idea 11: Top-K Constrained Draft SamplingIdea 7: Confidence-Gated Selective Redrafting⭐

Implementation complexity: Medium. Two draft passes + entropy computation + selective replacement.

Expected improvement: +0.5 to +2.0 tau (targeted improvement where it matters most).

Risk: Extra compute for the second pass. Entropy threshold needs tuning per dataset/model.

Pilot result: `[PENDING]

Core Idea: Apply top-k filtering to draft logits before sampling, zeroing out all but the top-k tokens at each position. This forces the draft to choose among only the most probable tokens.

Why it might work: For exact-match acceptance under greedy target decoding, only the target's argmax token matters. By restricting the draft's vocabulary to its own top-k, we reduce the chance of sampling a low-probability token that definitely won't match the target.

Implementation complexity: Very low. Single top-k operation on logits.

Expected improvement: +0.1 to +0.5 tau (minor, since greedy draft already picks argmax; mainly helps with stochastic target).

Risk: Under greedy draft + greedy target, this is a no-op. Only helps when draft uses non-zero temperature.

Pilot result: [PENDING]

Idea 12: Position-Weighted Logit Scaling⭐⭐

Core Idea: Scale draft logits by a position-dependent factor: early positions get more aggressive scaling (sharper distribution = higher confidence), later positions get gentler scaling. Rationale: early positions matter most for prefix-based acceptance.

Why it might work: By sharpening early positions, we increase the probability that positions 1-3 are correct (the most critical for tau). Later positions can afford to be less sharp since they only matter if all earlier positions are accepted.

Implementation complexity: Very low. Multiply logits by a position-dependent vector.

Expected improvement: +0.2 to +1.0 tau.

Risk: Over-sharpening may concentrate probability on a wrong token. Needs careful calibration of the scaling schedule.

Pilot result: [PENDING]

Bonus Ideas (Not Yet Implemented)

Idea 13: Tree-Structured Verification

Verify multiple candidate continuations in a single batched target forward pass using packed attention with tree causal masks. This doesn't improve tau per-candidate but amortizes the verification cost across candidates, enabling higher effective throughput. Very promising for combining with N-best or beam approaches.

Idea 16: Draft-Target KL Alignment via Inference-Time Calibration⭐⭐⭐

Compute a lightweight calibration mapping (affine transform on draft logits) by running a small calibration set and measuring draft vs target token agreement. Apply this calibration at inference time without retraining.

Idea 17: Multi-Block Pipelining

Overlap the draft and verification phases across blocks. While the target model verifies block k, the draft model starts working on block k+1 using a speculative target_hidden extrapolation. If the speculation was right, the pipeline stays full.

Experiment Configuration

Parameter	Value
Target model	Qwen/Qwen3-4B
Draft model	z-lab/Qwen3-4B-DFlash-b16
Block size	16
Dataset	math500
Max samples	10
Max new tokens	512
Temperature	0.0 (greedy)
GPU	NVIDIA H200 (single GPU)
Attention	SDPA

Results Summary

#	Method	Avg tau	Delta	Pilot Signal
0	Baseline	8.63	-	-
1	Iterative Refinement (2 steps)	`[PENDING]`
2	Iterative Refinement (3 steps)	`[PENDING]`
3	N-Best Draft (K=2)	`[PENDING]`
4	N-Best Draft (K=3)	`[PENDING]`
5	Adaptive Block Size (4-16)	`[PENDING]`
6	Early-Position Beam (width=3)	`[PENDING]`
7	Draft Temp t=0.3	`[PENDING]`
8	Draft Temp t=0.1	`[PENDING]`
9	Token Recycling	`[PENDING]`
10	Selective Redraft (ent>1.5)	`[PENDING]`
11	Selective Redraft (ent>1.0)	`[PENDING]`
12	Majority Vote (K=3)	`[PENDING]`
13	Majority Vote (K=5)	`[PENDING]`
14	Shifted Target Layers (+2)	`[PENDING]`
15	Logit Averaging (2 pass)	`[PENDING]`
16	Logit Averaging (3 pass)	`[PENDING]`
17	Top-K Constrained (k=10)	`[PENDING]`
18	Position-Weighted Temp	`[PENDING]`

Generated 2026-04-01. Experiments running on NVIDIA H200, dflash conda env.