AIThe Decoder1h ago

UK's AI Security Institute finds standard benchmarks systematically

UK's AI Security Institute finds standard benchmarks systematically underestimate what AI agents can actually do

UK's AI Security Institute finds standard benchmarks systematically

In a study covering seven benchmarks, the UK's AI Security Institute shows that standard AI evaluations systematically underestimate agent capabilities by capping the compute budget. On software engineering tasks, success rates jumped about 25 percent when the token budget was…

Read full article

Source: The Decoder · Opens in new tab