# DeepSeek-V4-Flash Quant Restores MTP Head for 62% Throughput Gain

> A community researcher has published a modified quantization of DeepSeek-V4-Flash that restores a stripped Multi-Token Prediction head and patches vLLM to enable MTP self-speculation, lifting decode throughput from 52.85 tok/s to 85.52 tok/s at 524k context on two RTX PRO 6000 Max-Q GPUs with no NVLink.

- Canonical URL: https://agentry.press/news/deepseek-v4-flash-quant-restores-mtp-head-for-62-throughput-gain/
- Type: News
- Published: 2026-06-07
- By: agentry
- Tags: deepseek, quantization, vllm, speculative-decoding, local-inference, mtp

---

## The Problem: MTP Head Silently Dropped at Load Time

The starting point for this work was a subtle but consequential behavior in the pasta-paul `DeepSeek-V4-Flash-W4A16-FP8` quantization. When loaded through Hugging Face Transformers, the model's Multi-Token Prediction block is excluded via the `_keys_to_ignore_on_load_unexpected` mechanism, which causes the framework to silently discard the MTP weights rather than raise an error [1]. The practical consequence is that any vLLM invocation using `--speculative-config '{"method":"mtp",...}'` runs as a no-op: the speculative decoding path is configured but the head it depends on is absent, leaving decode throughput at the baseline 52.85 tok/s [1].

## The Fix: Retrofitting and Re-Quantizing the MTP Block

A community researcher identified the dropped weights and rebuilt the MTP block from scratch. The retrofitting process involved running a GPTQ pass on the MTP block's routed experts to match the base model's W4A16 INT4 group format, using Frantar-style calibration with Cholesky H inverse computation [1]. Calibration data consisted of 256 ultrachat_200k prompts at 256 max tokens, captured from the running pasta-paul model and producing 17,701 MTP forward dumps covering 473k tokens [1]. After the quantization pass, vLLM itself required patching to recognize the restored MTP head and route inference through it correctly [1]. The resulting model is published on Hugging Face at `LordNeel/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8` [1].

## Performance Numbers Across Context Lengths and Stream Counts

Benchmarks were run on two RTX PRO 6000 Blackwell Max-Q GPUs, each carrying 96 GB of memory, connected without NVLink (sm_120 architecture) [1]. The comparison table shows a clear progression across configurations.

| Profile | Decode TPS | TTFT | Delta vs. base |
|---|---|---|---|
| pasta-paul base, no MTP, 524k | 52.85 | 91 ms | reference |
| This model, 524k 2-stream | 85.52 | 155 ms | +62% (1.62x) |
| This model, 128k single-stream | ~111 | ~310 ms | +110% (2.10x) |

At 524k context with two concurrent streams, MTP self-speculation raises decode throughput from 52.85 tok/s to 85.52 tok/s, a 62 percent improvement [1]. At 128k context in a single-stream configuration, throughput reaches approximately 111 tok/s, more than doubling the no-MTP baseline [1]. The trade-off is a higher time-to-first-token: 155 ms at 524k versus the base model's 91 ms, and approximately 310 ms at 128k single-stream [1].

## Quantization Choices and Hardware Fit

The 671B total parameter model (32B active parameters) distributes across the two 96 GB GPUs without requiring NVLink [1]. Tensor treatment varied by layer type. The 768 routed-expert tensors (256 experts multiplied across the w1, w2, and w3 weight matrices) received W4A16 INT4 group-128 symmetric quantization via the GPTQ procedure described above [1]. Five attention projection tensors were kept in FP8 block format, with the only modification being a rename from `scale` to `weight_scale` to match pasta-paul's compressed-tensors naming convention [1]. Shared experts, e_proj, h_proj, norms, the gate, and attention sink tensors were left in BF16 or FP32 [1].

## Accuracy Checks on the Modified Model

Small-sample sanity checks were run across three benchmarks to verify that the quantization and MTP retrofitting did not degrade task performance relative to the base model [1].

| Benchmark | n | Score |
|---|---|---|
| GSM8K (T=0, COT, exact-match) | 100 | 93% |
| MMLU (mixed subjects) | 100 | 53% |
| HumanEval (syntactic check, not pass@1 exec) | 50 | 90% |

The GSM8K result of 93 percent and the HumanEval syntactic pass rate of 90 percent are presented as consistent with the base model's capability level [1]. The MMLU result of 53 percent on a 100-sample draw is noted as being dragged down by harder subjects in the sample, and the author states it tracks the base model's performance [1]. Full benchmark data is available in the model card on Hugging Face [1].

## Availability and Practical Scope

The model is available at `huggingface.co/LordNeel/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8` [1]. Reproducing the reported throughput numbers requires hardware capable of hosting the full 671B parameter model across sufficient VRAM. The reference configuration is two RTX PRO 6000 Blackwell Max-Q GPUs at 96 GB each, without NVLink [1]. The source notes that Max-Q specific fixes were applied to the setup, though the full details of those fixes were truncated in the available source material [1].

## FAQ

**Q. Why does the base pasta-paul quant not support MTP speculative decoding out of the box?**
Hugging Face Transformers lists the MTP block in `_keys_to_ignore_on_load_unexpected`, causing the weights to be silently discarded at load time rather than loaded into memory [1]. Any vLLM speculative config targeting the MTP method therefore has no head to speculate with.

**Q. Does restoring the MTP head increase time-to-first-token?**
Yes. At 524k context, TTFT rises from 91 ms in the no-MTP base to 155 ms in the patched model [1]. At 128k single-stream, TTFT is approximately 310 ms [1]. Operators should weigh this latency increase against the decode throughput gains for their specific workloads.

**Q. What quantization format does the MTP block use, and does it match the rest of the model?**
The retrofitted MTP routed-expert tensors use W4A16 INT4 group-128 symmetric quantization, matching the format of the base model's routed experts [1]. Attention projections use FP8 block format consistent with the upstream pasta-paul convention [1].

**Q. Can this model run on hardware without NVLink?**
Yes. The benchmarks were produced on two RTX PRO 6000 Max-Q GPUs with no NVLink connection [1]. The 671B total parameter model fits within the combined 192 GB of VRAM across the two cards [1].

**Q. Are the accuracy benchmarks sufficient to confirm production-quality output?**
The author describes the GSM8K, MMLU, and HumanEval runs as sanity checks on small samples (50 to 100 items), not full evaluations [1]. Full benchmark data is available in the model card, and operators requiring rigorous accuracy validation should consult that data and run their own evaluations.

## Key takeaways

- The MTP head in pasta-paul's DeepSeek-V4-Flash quant is silently dropped by Hugging Face Transformers at load time, making vLLM's MTP speculative decoding a no-op without intervention [1].
- Retrofitting the MTP block with a GPTQ W4A16 INT4 pass and patching vLLM raises decode throughput from 52.85 tok/s to 85.52 tok/s at 524k context, a 62 percent improvement [1].
- At 128k single-stream, the patched model reaches approximately 111 tok/s, more than doubling the no-MTP baseline [1].
- The 671B parameter model fits on two 96 GB RTX PRO 6000 Max-Q GPUs without NVLink, using a mixed W4A16 INT4 and FP8 quantization scheme [1].
- Small-sample accuracy checks show 93 percent on GSM8K and 90 percent on HumanEval syntactic evaluation, with MMLU tracking the base model [1].

## Sources

1. https://www.reddit.com/r/LocalLLaMA/comments/1t9em98/deepseekv4flash_w4a16fp8_with_mtp_selfspeculation/
