Stop Thinking So Hard

Large reasoning models have an overthinking problem. They reach the correct answer early in their chain of thought — then keep generating thousands of additional tokens reconsidering, double-checking, and exploring alternatives they'll ultimately discard. A new paper from researchers at UT Austin, EPFL, ENS Paris-Saclay, and Telecom Paris introduces TERMINATOR, an inference-time early-exit strategy that detects when a model has already generated its final answer and stops reasoning immediately.

The key insight is that the first arrival of a model's final answer in its chain of thought is detectable from hidden states. Token confidence spikes distinctly at the answer position. Thinking-word usage shifts — words like "hmm" and "okay" cluster before the answer; words like "another" and "alternatively" cluster after. These signals are real, consistent across math, coding, and science domains, and learnable by a small classifier.

TERMINATOR is a single transformer layer — initialized from the base model's final layer — with a binary prediction head trained to predict answer arrival at every token position. At inference time, a sliding window of the ten most recent predictions triggers a stop when majority vote says the answer is already there, injecting a close-thinking token into the token stream. No data-calibrated thresholds. No test-time distribution samples. Train once, deploy anywhere.

Results

Tested on Qwen3-8B, Qwen3-14B, Ministral-3-8B-Reasoning, and Ministral-3-14B-Reasoning across MATH-500, AIME 2025, HumanEval, and GPQA:

Best or second-best on 28 out of 32 metrics (accuracy + compression rate)
MATH-500: ~45% token reduction, accuracy drop under 0.5 percentage points
AIME 2025: ~30% reduction; TERMINATOR exits too early on hard problems — documented failure mode
Consistently occupies the best accuracy-efficiency Pareto frontier position versus DEER, Dynasor, Thought Calibration, and NoThinking

Related Work Mentioned

DEER — chunk-based early exit via token probability thresholds
Dynasor — periodic intermediate answer consistency checks
Thought Calibration — linear probes on reasoning step hidden states
Self-Certainty / Kang et al. — KL divergence confidence metric for reasoning
DeepSeek-R1 — large reasoning model showing overthinking phenomenon
Qwen3 — base models used in experiments
vLLM — inference framework used for dataset curation

Datasets

MATH — Lightman et al., mathematical problem solving
AIME 2025 — American Invitational Mathematics Examination
HumanEval — Chen et al., Python code generation
GPQA — Rein et al., graduate-level science questions
OpenScience — NVIDIA, scientific research dataset
OpenCoder-SFT — Huang et al., code instruction fine-tuning

DTF:FTL is produced by PDX Hackerspace Foundation. Find us on Apple Podcasts, Spotify, or wherever fine podcasts are distributed.

Listen

33: Stop Thinking So Hard

Show Notes

Stop Thinking So Hard

Results

Links

Related Work Mentioned

Datasets