Researchers may have found a way to stop AI models from intentionally playing dumb during safety evaluations

AIThe Decoder3h ago

Researchers may have found a way to stop AI models from intentionally playing dumb during safety evaluations

A study by researchers from the MATS program, Redwood Research, the University of Oxford, and Anthropic examines a safety problem that grows more pressing as AI systems become more capable: "sandbagging," where a model deliberately hides its true abilities and delivers work that…

Read full article

Source: The Decoder · Opens in new tab

Share on X Share on LinkedIn