Claude Opus 4.8 thinking capability degrades on hard prompts — Intelligence Feed

TL;DR: Claude Opus 4.8 Thinking has dropped in ranking on the LMArena Hard Prompts (English) benchmark compared to older Opus versions, signaling a performance regression.

Summary: Recent data from the LMArena Hard Prompts leaderboard shows Claude Opus 4.8 Thinking scoring lower than both Opus 4.6 and 4.7. Users report that the regression appears linked to bugs in Anthropic's Inference Infrastructure during early release phase, causing degraded intelligence. Meanwhile, subjective reports still favor Opus 4.8 for frontend design and aesthetic tasks over competitors like GPT-5.

Why it matters: AI builders relying on Opus for complex, non-coding reasoning tasks may want to pin their workflows to version 4.6 or 4.7. Monitor Anthropic's infrastructure updates to see when these inference-related intelligence degradations are resolved.

Source: r/singularity

原文 (Original):

Opus 4.6 Thinking keeps the #1 spot. Followed by Opus 4.7 Thinking (-15 points). Lastly, Opus 4.8 Thinking (-23 points compared to 4.6 Thinking). https://arena.ai/leaderboard/text/hard-prompts-english