The Problem: MTP Head Silently Dropped at Load Time

The starting point for this work was a subtle but consequential behavior in the pasta-paul DeepSeek-V4-Flash-W4A16-FP8 quantization. When loaded through Hugging Face Transformers, the model’s Multi-Token Prediction block is excluded via the _keys_to_ignore_on_load_unexpected mechanism, which causes the framework to silently discard the MTP weights rather than raise an error [1]. The practical consequence is that any vLLM invocation using --speculative-config '{"method":"mtp",...}' runs as a no-op: the speculative decoding path is configured but the head it depends on is absent, leaving decode throughput at the baseline 52.85 tok/s [1].

The Fix: Retrofitting and Re-Quantizing the MTP Block

A community researcher identified the dropped weights and rebuilt the MTP block from scratch. The retrofitting process involved running a GPTQ pass on the MTP block’s routed experts to match the base model’s W4A16 INT4 group format, using Frantar-style calibration with Cholesky H inverse computation [1]. Calibration data consisted of 256 ultrachat_200k prompts at 256 max tokens, captured from the running pasta-paul model and producing 17,701 MTP forward dumps covering 473k tokens [1]. After the quantization pass, vLLM itself required patching to recognize the restored MTP head and route inference through it correctly [1]. The resulting model is published on Hugging Face at LordNeel/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8 [1].

Performance Numbers Across Context Lengths and Stream Counts

Benchmarks were run on two RTX PRO 6000 Blackwell Max-Q GPUs, each carrying 96 GB of memory, connected without NVLink (sm_120 architecture) [1]. The comparison table shows a clear progression across configurations.

ProfileDecode TPSTTFTDelta vs. base
pasta-paul base, no MTP, 524k52.8591 msreference
This model, 524k 2-stream85.52155 ms+62% (1.62x)
This model, 128k single-stream~111~310 ms+110% (2.10x)

At 524k context with two concurrent streams, MTP self-speculation raises decode throughput from 52.85 tok/s to 85.52 tok/s, a 62 percent improvement [1]. At 128k context in a single-stream configuration, throughput reaches approximately 111 tok/s, more than doubling the no-MTP baseline [1]. The trade-off is a higher time-to-first-token: 155 ms at 524k versus the base model’s 91 ms, and approximately 310 ms at 128k single-stream [1].

Quantization Choices and Hardware Fit

The 671B total parameter model (32B active parameters) distributes across the two 96 GB GPUs without requiring NVLink [1]. Tensor treatment varied by layer type. The 768 routed-expert tensors (256 experts multiplied across the w1, w2, and w3 weight matrices) received W4A16 INT4 group-128 symmetric quantization via the GPTQ procedure described above [1]. Five attention projection tensors were kept in FP8 block format, with the only modification being a rename from scale to weight_scale to match pasta-paul’s compressed-tensors naming convention [1]. Shared experts, e_proj, h_proj, norms, the gate, and attention sink tensors were left in BF16 or FP32 [1].

Accuracy Checks on the Modified Model

Small-sample sanity checks were run across three benchmarks to verify that the quantization and MTP retrofitting did not degrade task performance relative to the base model [1].

BenchmarknScore
GSM8K (T=0, COT, exact-match)10093%
MMLU (mixed subjects)10053%
HumanEval (syntactic check, not pass@1 exec)5090%

The GSM8K result of 93 percent and the HumanEval syntactic pass rate of 90 percent are presented as consistent with the base model’s capability level [1]. The MMLU result of 53 percent on a 100-sample draw is noted as being dragged down by harder subjects in the sample, and the author states it tracks the base model’s performance [1]. Full benchmark data is available in the model card on Hugging Face [1].

Availability and Practical Scope

The model is available at huggingface.co/LordNeel/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8 [1]. Reproducing the reported throughput numbers requires hardware capable of hosting the full 671B parameter model across sufficient VRAM. The reference configuration is two RTX PRO 6000 Blackwell Max-Q GPUs at 96 GB each, without NVLink [1]. The source notes that Max-Q specific fixes were applied to the setup, though the full details of those fixes were truncated in the available source material [1].

FAQ

Q. Why does the base pasta-paul quant not support MTP speculative decoding out of the box? Hugging Face Transformers lists the MTP block in _keys_to_ignore_on_load_unexpected, causing the weights to be silently discarded at load time rather than loaded into memory [1]. Any vLLM speculative config targeting the MTP method therefore has no head to speculate with.

Q. Does restoring the MTP head increase time-to-first-token? Yes. At 524k context, TTFT rises from 91 ms in the no-MTP base to 155 ms in the patched model [1]. At 128k single-stream, TTFT is approximately 310 ms [1]. Operators should weigh this latency increase against the decode throughput gains for their specific workloads.

Q. What quantization format does the MTP block use, and does it match the rest of the model? The retrofitted MTP routed-expert tensors use W4A16 INT4 group-128 symmetric quantization, matching the format of the base model’s routed experts [1]. Attention projections use FP8 block format consistent with the upstream pasta-paul convention [1].

Q. Can this model run on hardware without NVLink? Yes. The benchmarks were produced on two RTX PRO 6000 Max-Q GPUs with no NVLink connection [1]. The 671B total parameter model fits within the combined 192 GB of VRAM across the two cards [1].

Q. Are the accuracy benchmarks sufficient to confirm production-quality output? The author describes the GSM8K, MMLU, and HumanEval runs as sanity checks on small samples (50 to 100 items), not full evaluations [1]. Full benchmark data is available in the model card, and operators requiring rigorous accuracy validation should consult that data and run their own evaluations.

Key takeaways

  • The MTP head in pasta-paul’s DeepSeek-V4-Flash quant is silently dropped by Hugging Face Transformers at load time, making vLLM’s MTP speculative decoding a no-op without intervention [1].
  • Retrofitting the MTP block with a GPTQ W4A16 INT4 pass and patching vLLM raises decode throughput from 52.85 tok/s to 85.52 tok/s at 524k context, a 62 percent improvement [1].
  • At 128k single-stream, the patched model reaches approximately 111 tok/s, more than doubling the no-MTP baseline [1].
  • The 671B parameter model fits on two 96 GB RTX PRO 6000 Max-Q GPUs without NVLink, using a mixed W4A16 INT4 and FP8 quantization scheme [1].
  • Small-sample accuracy checks show 93 percent on GSM8K and 90 percent on HumanEval syntactic evaluation, with MMLU tracking the base model [1].