Frontier Thinking

90% of Human Expertise Is Not Verifiable

The AI training data market has a theory of how things work. Source more experts, recruit from bigger pools, throw bodies at the problem. For years, that theory held up. When the bar was low (read a prompt, write a response, rank some preferences), scale was the game. The bar has moved.

We’ve spent three years building human data infrastructure for frontier labs. Across every domain we've worked in, roughly 90% of expert work depends on human judgment with subjective criteria. Even in the most verification-friendly domains like math and code, only about 20% of an expert's routine workflow is deterministically checkable. These are approximate figures, averaged across our initiatives in healthcare, legal, accounting, software engineering, and data science. The pattern is consistent across domains. Most of what professionals actually do sits outside the reach of RLVR.

The field is debating whether RLVR produces genuine reasoning or just efficient search compression. Either way, the quality of the reward signal determines if it works. And across most professional domains, that signal ranges from weak to nonexistent.

Many RLVR verifiers collapse to a single binary signal. Pass or fail. A trace that gets the right answer through sloppy reasoning gets fully reinforced. A trace that fails but contains strong intermediate work gets fully penalized. These are mirror problems, and both degrade the training signal in ways that compound over time. Building verification granular enough to reward good behavior in failed traces and penalize poor behavior in successful ones, without just rewarding memorization of the rubric, is an unsolved problem. And the workarounds are systematically degrading the training data.

How verification requirements corrupt the tasks they're supposed to measure

Models are very good at appearing capable when they're retrieving patterns rather than generalizing. They fail in ways that make it hard to assess whether the task was unclear or whether there's a real capability gap. The industry's response has been to over-specify. To get a task approved, it needs to be gradeable against rigid rubric criteria, so prompts get artificially restructured with strict formatting until they can be deterministically scored.

This collapses the task into something it wasn't designed to be. Over-specification removes the decision points where expert judgment actually matters. It eliminates unconventional approaches that happen to be correct. I've watched dozens of projects where what we called "expert reasoning" tasks were actually instruction-following. The requirements placed on questions to make them verifiable were distorting the training signal those questions were supposed to produce.

The verifier you can't evaluate

There's a second failure mode. When a task is too difficult, the model never completes it successfully. You're left with zero signal. Was the verifier broken? Too strict? Was the task genuinely beyond the model's current capability? Sometimes the failure is mundane, a folder named incorrectly for retrieval. Sometimes it's a real gap. The tooling doesn't help you tell the difference. You can't evaluate the verifier itself when the model produces zero successful completions.

This means the hardest tasks, the ones most likely to push model capabilities forward, are the ones where you have no feedback loop at all.

The numbers behind the waste

Even when tasks are well-designed, most need iteration. In a recent data science RL task collection project involving multi-tool use, roughly 70% of tasks needed adjustments to fit the difficulty window that labs require (often 1 to 7 successful completions out of 10 model runs). Too easy, too hard. Barely anything lands right the first time.

A single expert might spend 30 to 40 hours developing one task. Building the scenario, designing the verifier, iterating with models, waiting days for review feedback. The whole pipeline can take four weeks to produce one usable RL task. The industry treats this as an execution problem. Verification is what's actually broken.

What needs to happen

We need verification methodologies that make hard-to-verify tasks verifiable. Not by over-specifying them into instruction-following exercises, but by developing reward signals that can evaluate expert judgment without destroying it.

These methods compound. A verification approach for drug interaction analysis combines with one for clinical trial design to reach pharmacovigilance tasks that neither could access alone. The 90% of expert work sitting outside RLVR today becomes trainable as each new method opens adjacent domains.

Training data for deterministic tasks will be commoditized soon. The companies still competing on math and code are fighting over the wrong market. The future belongs to whoever figures out verification for the long tail of human intelligence, the judgment-dependent, nondeterministic work across professional domains.