Speculative decoding already beats autoregressive generation — but it still has a sequential bottleneck: verification must finish before drafting restarts. Saguaro (Speculative Speculative Decoding) breaks that dependency by pre-speculating for likely verifica...

Episode 0023: Making the Wait Do Work

Why it matters. Links to arXiv:2603.03251. Explains Saguaro / SSD — the second speculation layer that keeps the draft model productive during verifier execution. 2× faster than optimized SD, 5× autoregressive, lossless.

Stanford / Together AI. Links to the paper, GitHub (tanishqkumar/ssd), and both model pages (Llama 3.1 70B target, Llama 3.2 1B draft).

The Researchers. Three authors with confirmed Google Scholar IDs: - Tanishq Kumar — Stanford CS PhD - Tri Dao — Princeton / Together AI, FlashAttention - Avner May — Staff Research Scientist, Together AI

Key Technical Concepts. Links to: original speculative decoding (arXiv:2211.17192), speculative sampling (arXiv:2302.01318), FlashAttention (arXiv:2205.14135), FlashAttention-2 (arXiv:2307.08691), Llama 3.1 paper. Covers the three core challenges, the 90% bonus token prediction result, cache hit/miss fallback, and the CPU branch prediction analogy from the paper.

~20 verified links total. All arXiv IDs pulled from search results, no fabricated URLs.

Listen

23: Saguaro: The Algorithm That Doesn't Wait

Show Notes

Episode 0023: Making the Wait Do Work