AI Sec Watch: A Security Intelligence Platform for AI Systems

Luu, T.J.

GHSA-83vm-p52w-f9pw: vLLM: extract_hidden_states speculative decoding crashes server on any request with penalty parameters

mediumvulnerabilityLLM-Specific

security

Source: GitHub Advisory DatabaseMay 6, 2026CVE-2026-44223

Summary

In vLLM versions 0.18.0 through 0.19.1, a bug in the `extract_hidden_states` speculative decoding proposer (a component that predicts tokens ahead of time to speed up AI inference) causes the server to crash when any request includes sampling penalty parameters like `repetition_penalty`. The crash happens because the proposer returns a tensor (multi-dimensional array) with the wrong shape after the first step, causing a shape mismatch error when penalties are applied.

Solution / Mitigation

Fixed in vLLM v0.20.0 (PR #38610) by slicing the return value to `sampled_token_ids[:, :1]` to ensure the correct shape. If upgrading is not possible, either avoid using `extract_hidden_states` as the speculative decoding method, or strip penalty parameters (`repetition_penalty`, `frequency_penalty`, `presence_penalty`) from incoming requests at an API gateway before they reach vLLM.