🧠 Grok 4 by @xai is making strides in reasoning benchmarks, but the picture is more nuanced than the scores suggest. Here’s how it stacks up — and what we can really learn from its results 🧵 📊 Full eval: 1️⃣ Grok 4 scores: • AI2 Reasoning Challenge (Easy): 98% • AIME 2025 (Math): 89% • Accounting Audit: 84% • MMLU-Plus: 64% • Data4Health: 55% These are top-line scores — but let’s zoom in on what’s working and what still fails. 2️⃣ AIME 2025 ✅ Handles algebra, geometry, number theory ✅ Follows LaTeX formatting rules ❌ Struggles with multi-step logic ❌ Errors in combinatorics ❌ Format precision issues (e.g. missing °) 3️⃣ Accounting Audit ✅ Strong on ethics & reporting ✅ Solid grasp of auditing principles ❌ Misinterprets similar procedures ❌ Fails to spot subtle answer differences ❌ Hard time applying theory to real-world cases 4️⃣ The real insight? Even a model with 98% on some tasks can fail hard under ambiguity or formatting stress. Benchmarks like AIME and Audit show how it fails, not just how much it scores. 5️⃣ Why this matters: We need transparent, per-task evaluation — not just leaderboards. #Grok4 is powerful, but still brittle in high-stakes, real-world domains. 🧪 Explore the full breakdown: #AI #LLMs #Benchmarking
1,03K