LLMs are typically evaluated w/ automatic metrics on standard test sets, but metrics + test sets are developed independently. This raises a crucial question: Can we design automatic metrics specifically to excel on the test sets we prioritize? Answer: Yes!
11,37K