FrontierMath benchmark undergoes major audit as Epoch AI flags errors in one-third of math problems

1 hour ago 13

Epoch AI’s FrontierMath benchmark, a 350-problem test designed to push AI systems to their mathematical limits, is undergoing a significant correction after an internal review flagged errors in roughly one-third of its dataset. The audit, disclosed on May 11, 2026, revealed that the problems designed to stump the world’s most advanced AI models had a quality control issue of their own.

The organization plans to release updated scores once a thorough human review is completed.

What FrontierMath actually is, and why it matters

FrontierMath launched in November 2024 and was developed in collaboration with more than 60 mathematicians. The full dataset includes 300 problems across Tiers 1 through 3, spanning undergraduate to advanced graduate difficulty. Tier 4 adds another 50 problems at the research level, the kind of questions where even professional mathematicians might need multiple hours or days to solve.

Earlier reviews of the dataset had suggested error rates in the range of 7% to 10%, based on limited secondary checks. The AI-assisted review that Epoch AI conducted painted a much less flattering picture, bumping that estimate to approximately 33% of problems containing what the organization described as fatal errors.

The audit process and what went wrong

The errors flagged weren’t typos or formatting issues. They were described as fatal, meaning the problems themselves were fundamentally flawed in ways that would make correct answers impossible or ambiguous.

Epoch AI has committed to completing a full human review of every flagged problem before releasing corrected scores. Any model scores previously reported against FrontierMath should be taken with a generous grain of salt until the corrected version drops.

Why AI benchmarks should be on every crypto investor’s radar

FrontierMath has no connection to crypto, blockchain, or tokens. It lives squarely in the domain of pure mathematics and AI evaluation.

Updated scores from the cleaned dataset will likely shift the perceived capability frontier for leading models, potentially in either direction. As of June 12, 2026, no confirmation has been provided regarding a version 2 of the dataset.

Disclosure: This article was edited by Editorial Team. For more information on how we create and review content, see our Editorial Policy.

Read Entire Article