The mitigating factor for the problem with AI benchmarks (errors, saturation, contamination) is that, despite issues, they are all still fairly heavily correlated. So if your AI does well on GPQA or MMLU or HLE it also tends to do well on other benchmarks & on vibes & real work.
12,14K