This is exactly why human-in-the-loop pipelines are necessary for the foreseeable future, as task complexity and horizon goes up, success rates significantly drops. You'll need humans to consistently ground the process in order to maintain success rates. The main issues with HITL approaches are: 1. fine-tuned LLMs have gotten so good to a point where it's hard for humans to assess if the outputs are actually meeting objective requirements because a lot of work have gone into making it 'appear good'. 2. knowing when a human should intervene or when the agent/model should handoff the task/evaluation. Hallucination detection is one hell of a topic
Benjamin Todd
Benjamin Todd16.6.2025
Why can AIs code for 1h but not 10h? A simple explanation: if there's a 10% chance of error per 10min step (say), the success rate is: 1h: 53% 4h: 8% 10h: 0.002% @tobyordoxford has tested this 'constant error rate' theory and shown it's a good fit for the data chance of success declines exponentially
1,3K