Speculative Decoding Implementations: EAGLE-3, Medusa-1, PARD, Draft Models, N-gram and Suffix Decoding from scratch [P]

Our take

Introducing the Speculative Decoding Implementations repository, designed to empower users in understanding and exploring various speculative decoding methods from scratch. This educational resource features implementations of EAGLE-3, Medusa-1, PARD, draft models, n-gram, and suffix decoding, all organized under a shared decoding and evaluation contract. By focusing on the nuances of proposer designs, the repo clarifies critical distinctions, such as proposer quality versus verifier cost.

I’ve been working on an educational implementation repo for speculative decoding:

https://github.com/shreyansh26/Speculative-Decoding

The goal is not to wrap existing libraries, but to implement several speculative decoding methods from scratch behind a shared decoding/evaluation contract so that the differences between proposer designs are easier to study.

Implemented methods so far:

EAGLE-3
Medusa-1
standard draft model speculation
PARD / parallel draft models
n-gram prompt lookup
suffix decoding

The repo has both training and inference paths where applicable. For learned proposers, I use Qwen/Qwen2.5-7B-Instruct as the target model and small learned/speculative heads or draft models, depending on the method. For training-free methods, the proposer is built from the prompt/generated context.

A few things I wanted the repo to make explicit:

The distinction between proposer quality and verifier cost.
Why a high acceptance rate does not always imply higher throughput.
Why methods like PARD can be faster despite lower acceptance than an autoregressive draft model.
How EAGLE/Medusa-style learned heads differ from draft-model speculation.
How simple methods like n-gram and suffix decoding behave when the prompt contains a reusable structure.

The repo includes benchmark summaries, command lines, checkpoints/exports, and implementation notes. Some results are intentionally on small train-overlap eval slices due to compute constraints, so I would treat the numbers as implementation/behavioral benchmarks rather than broad generalization claims.

I built this mostly as a learning resource for people who want to understand speculative decoding at the algorithm + systems boundary: how the proposer is trained, how draft tokens are generated, how target verification works, what gets cached, and where the speedups actually come from.

submitted by /u/shreyansh26
[link] [comments]

Tagged with

#natural language processing for spreadsheets#generative AI for data analysis#Excel alternatives for data analysis#rows.com#machine learning in spreadsheet applications#Speculative Decoding#EAGLE-3#Medusa-1#PARD#Draft Models#Proposer#N-gram#Suffix Decoding#Verifier#Acceptance Rate#Benchmark Summaries#Learning Resource#Throughput#Implementation Notes#Instruct Model