I’ve been around long enough to watch the AI reproducibility crisis unfold in slow motion. Everyone talks about it, but nobody wants to pay for the fix. Google Research just dropped a paper that actually tries to quantify the problem, and the results are more interesting than I expected.
The core issue is simple: ground truth in ML evaluation usually comes from humans. Humans disagree. A lot. But most benchmarks pretend this disagreement doesn’t exist by collapsing multiple ratings into a single “plurality” label. Two examples can have the same plurality label while being radically different in how clear-cut the human consensus actually is. That’s a problem.
Google’s paper, “Forest vs Tree: The (N,K) Trade-off in Reproducible ML Evaluation,” asks a deceptively simple question: given a fixed budget, should you rate more items with fewer raters per item (breadth, the forest), or fewer items with more raters per item (depth, the tree)?
Historically, the field has gone with breadth. Most researchers use 1 to 5 raters per item and assume that’s sufficient to find the “correct” truth. Google’s simulator, built on real-world datasets involving subjective tasks like toxicity and hate speech detection, suggests this standard is often woefully inadequate.
They ran a massive stress test, varying N (total items, from 100 to 50,000) and K (raters per item, from 1 to 500). The goal was to find configurations that produced statistically reliable results (p < 0.05) — meaning another team running the same experiment would get the same answer.
The findings? It depends on the task, but the sweet spot is almost never at K=1 or K=2. For many subjective tasks, you need more raters per item than researchers typically budget for. The trade-off isn’t just about cost; it’s about whether your benchmark actually measures what you think it measures.
I’ve seen too many papers where authors brag about their massive dataset but skimped on the annotation quality. This research gives them a framework to actually calculate the optimal allocation instead of guessing. The open source simulator they released is probably the most practical takeaway here — you can plug in your own task characteristics and budget constraints and get a concrete recommendation.
One thing that bugged me: the paper focuses on statistical reproducibility but doesn’t deeply address the rater quality problem. Not all raters are equal. A crowd of 500 random MTurkers is not the same as 10 domain experts. The simulator assumes raters are interchangeable, which is a big assumption for many real-world tasks.
Still, this is a solid contribution. If you’re building a benchmark or running an evaluation, stop defaulting to 3 raters per item. Run the numbers. You might find you need fewer items and more raters to get results that actually hold up when someone else tries to replicate them.
Comments (0)
Login Log in to comment.
Be the first to comment!