Efficient LLM System with Speculative Decoding
Xiaoxuan Liu
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2025-224
December 19, 2025
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-224.pdf
Large language model (LLM) inference is increasingly deployed in latency-critical and cost-sensitive settings, yet remains fundamentally constrained by the sequential nature of autoregressive decoding. Speculative decoding has emerged as a promising technique to mitigate this bottleneck by introducing intra-request parallelism, allowing multiple tokens to be verified in parallel by a target model. However, despite its growing adoption, speculative decoding exhibits fragile and highly variable performance in real-world systems, with effectiveness depending sensitively on workload characteristics, batch sizes, model configurations, and system conditions.
This dissertation studies speculative decoding from a full-stack systems perspective, spanning algorithmic design, empirical characterization, and production-grade control mechanisms. First, we introduce Online Speculative Decoding OSD, a framework that continuously adapts draft models to the evolving query distribution during serving. By leveraging knowledge distillation, OSD improves token acceptance rates on the fly and substantially reduces inference latency without increasing draft model size, demonstrating that speculative decoding performance can be dynamically optimized rather than statically configured.
Next, we introduce TurboSpec, a closed-loop control system for speculative decoding in LLM serving. TurboSpec formalizes goodput—the rate of successfully generated tokens—as a unifying system metric and uses offline profiling combined with online feedback to dynamically adjust speculative parameters at runtime. By adaptively balancing inter-request batching and intra-request speculation, TurboSpec robustly optimizes performance across diverse workloads, hardware platforms, and speculative decoding methods, eliminating the need for manual tuning and preventing performance regressions under high load or low acceptance regimes.
Lastly, we present the first systematic, production-grade evaluation of speculative decoding in a widely deployed inference engine. Through extensive benchmarking across speculative decoding variants, workloads, batch sizes, and model scales, we uncover that verification by the large target model remains the dominant cost, while acceptance behavior varies dramatically across token positions, requests, and datasets. We quantify a theoretical upper bound on speculative decoding speedup and show that existing methods fall far short of this limit. This analysis reframes speculative decoding not merely as a drafting problem, but as a verification efficiency problem, and identifies new opportunities for adaptive and selective verification.
Together, these contributions establish a principled and practical foundation for speculative decoding in real-world LLM serving systems. This dissertation demonstrates that achieving reliable inference acceleration requires not only better draft models, but also adaptive learning, rigorous system-level analysis, and feedback-driven control—pointing toward a new generation of intelligent inference systems that continuously optimize themselves in production.
Advisors: Alvin Cheung
BibTeX citation:
@phdthesis{Liu:EECS-2025-224,
Author= {Liu, Xiaoxuan},
Title= {Efficient LLM System with Speculative Decoding},
School= {EECS Department, University of California, Berkeley},
Year= {2025},
Month= {Dec},
Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-224.html},
Number= {UCB/EECS-2025-224},
Abstract= {Large language model (LLM) inference is increasingly deployed in latency-critical and cost-sensitive settings, yet remains fundamentally constrained by the sequential nature of autoregressive decoding. Speculative decoding has emerged as a promising technique to mitigate this bottleneck by introducing intra-request parallelism, allowing multiple tokens to be verified in parallel by a target model. However, despite its growing adoption, speculative decoding exhibits fragile and highly variable performance in real-world systems, with effectiveness depending sensitively on workload characteristics, batch sizes, model configurations, and system conditions.
This dissertation studies speculative decoding from a full-stack systems perspective, spanning algorithmic design, empirical characterization, and production-grade control mechanisms. First, we introduce Online Speculative Decoding OSD, a framework that continuously adapts draft models to the evolving query distribution during serving. By leveraging knowledge distillation, OSD improves token acceptance rates on the fly and substantially reduces inference latency without increasing draft model size, demonstrating that speculative decoding performance can be dynamically optimized rather than statically configured.
Next, we introduce TurboSpec, a closed-loop control system for speculative decoding in LLM serving. TurboSpec formalizes goodput—the rate of successfully generated tokens—as a unifying system metric and uses offline profiling combined with online feedback to dynamically adjust speculative parameters at runtime. By adaptively balancing inter-request batching and intra-request speculation, TurboSpec robustly optimizes performance across diverse workloads, hardware platforms, and speculative decoding methods, eliminating the need for manual tuning and preventing performance regressions under high load or low acceptance regimes.
Lastly, we present the first systematic, production-grade evaluation of speculative decoding in a widely deployed inference engine. Through extensive benchmarking across speculative decoding variants, workloads, batch sizes, and model scales, we uncover that verification by the large target model remains the dominant cost, while acceptance behavior varies dramatically across token positions, requests, and datasets. We quantify a theoretical upper bound on speculative decoding speedup and show that existing methods fall far short of this limit. This analysis reframes speculative decoding not merely as a drafting problem, but as a verification efficiency problem, and identifies new opportunities for adaptive and selective verification.
Together, these contributions establish a principled and practical foundation for speculative decoding in real-world LLM serving systems. This dissertation demonstrates that achieving reliable inference acceleration requires not only better draft models, but also adaptive learning, rigorous system-level analysis, and feedback-driven control—pointing toward a new generation of intelligent inference systems that continuously optimize themselves in production.},
}
EndNote citation:
%0 Thesis %A Liu, Xiaoxuan %T Efficient LLM System with Speculative Decoding %I EECS Department, University of California, Berkeley %D 2025 %8 December 19 %@ UCB/EECS-2025-224 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-224.html %F Liu:EECS-2025-224