VAKRA: Why Enterprise AI Agents Keep Tripping Over Simple Tasks

IBM Research just dropped VAKRA, a benchmark that makes most current AI agents look pretty incompetent. And I mean that in the most constructive way possible.

VAKRA stands for something I won’t bother spelling out because what matters is what it does: it tests AI agents on real, messy, multi-step enterprise tasks. Not the sanitized, single-turn question-answering that most benchmarks use. Actual workflows where an agent has to chain API calls, dig through documents, and keep track of what it’s doing across several steps.

Here’s the headline: models perform poorly. Embarrassingly so in many cases. And the failure modes tell us more about where the field actually is than any leaderboard ever could.

What VAKRA Actually Tests

The benchmark is built around four capabilities, each designed to stress a different aspect of agentic reasoning. The full dataset covers 62 domains with over 8,000 locally hosted APIs backed by real databases. Tasks require 3 to 7 reasoning steps, combining structured API calls with unstructured document retrieval.

Capability 1: API Chaining with Business Intelligence APIs

This one has 2,077 test instances across 54 domains. The idea is simple: give an agent access to a set of data manipulation tools — filtering, sorting, aggregation — and see if it can chain them correctly to answer a question.

For example, the agent might need to figure out which football team has a build-up play speed of 31, a dribbling of 53, and a passing of 32. To do that, it has to call get_data first to initialize the data source, then chain three select_data_equal_to calls, and finally call get_team_name.

Each wrong step cascades. Miss a filter value? Wrong answer. Call tools in the wrong order? Also wrong. The benchmark tracks the full execution trace, so you can see exactly where each agent went off the rails.

The tool design here is interesting. The get_data call returns a lightweight preview rather than the full dataset, which avoids the MCP protocol choking on large data transfers. Smart engineering, but it means agents have to reason about data they can’t fully see.

Capability 2: Tool Selection with Dashboard APIs

This one has 1,597 instances across 17 domains. The twist here is that each domain exposes a large set of highly specific, query-aligned endpoints — think REST APIs wrapped in MCP servers. The agent has to pick the right tool from a potentially huge set.

And I mean huge. Some domains have up to 328 tools, with an average of 116. That’s a lot of options, and agents have to figure out which one actually does what the query needs.

Here’s a practical problem that immediately jumps out: the OpenAI API spec limits tool lists to 128. So if you’re building on OpenAI, you literally cannot expose all the tools for some domains. You have to architect around this, probably by grouping or routing. VAKRA exposes this constraint in a way that feels very real-world.

Where Agents Fall Apart

Looking at the failure patterns, a few things stand out.

First, agents are terrible at maintaining state across multiple tool calls. They’ll correctly call get_data, then immediately forget what data they’re working with. Or they’ll get a filtered result and then apply the wrong subsequent filter.

Second, tool selection goes wrong in predictable ways. Given a large set of options, agents tend to grab the first tool that looks vaguely relevant rather than reasoning about which one actually fits. It’s like watching someone try to fix a car with whatever tool is on top of the toolbox.

Third, error recovery is almost nonexistent. If an API call fails or returns unexpected data, most agents just give up or hallucinate a result. They don’t retry, they don’t adapt their approach, they don’t ask for clarification.

This last one is particularly damning for enterprise use. In real business environments, APIs fail, data is messy, and queries are ambiguous. An agent that can’t handle any of that is not ready for production.

Why This Benchmark Matters

Most existing benchmarks test isolated skills. Can the model answer this question? Can it write this code? Can it retrieve this fact? VAKRA tests composition — can the model do all of these things in sequence, correctly, without losing the thread?

That’s a fundamentally harder problem, and it’s the one that actually matters for enterprise deployment. No one in a real business asks a single, self-contained question. They ask things like “Show me Q3 sales by region, then compare them to last year, then flag any regions that dropped more than 10%.”

VAKRA also has the virtue of being executable. The benchmark doesn’t just check whether the final answer looks right — it runs the agent’s tool calls against real databases and checks the actual outputs. This catches a whole class of failures that text-only evaluation would miss.

I also appreciate that the dataset is grounded in real enterprise domains. These aren’t toy problems. The 62 domains include finance, healthcare, logistics, sports analytics, and more. The APIs and documents are aligned to each domain, so agents can’t just pattern-match their way through.

The Elephant in the Room

VAKRA is hard. That’s the point. But it’s also worth asking whether the benchmark is too hard in ways that don’t reflect real usage.

For example, requiring agents to call get_data before every sequence is a reasonable design choice, but it’s also a potential failure point that humans would never encounter. A human analyst would already have the data loaded. They wouldn’t need to initialize a data source every time they want to filter a column.

Similarly, the tool naming conventions in SEL-BIRD — where categorical arguments are flattened into separate functions — feel like they might be testing for quirks in the API design rather than genuine reasoning ability. A human would just click a sort button. An agent has to figure out whether to call sort_data_ascending or sort_data_descending.

These are minor gripes. Overall, VAKRA is a serious contribution to a real problem. The field needs more benchmarks that stress composition and execution, not just memorization and pattern matching.

What Comes Next

The VAKRA dataset, leaderboard, and code are all available on GitHub. If you’re building enterprise agents, you should be testing against this. If your model can’t handle multi-step API chaining with real data, you have work to do.

And if you’re a model developer, this is a good reality check. The next wave of agent frameworks will need to handle state management, error recovery, and tool selection at a level that current models simply don’t achieve. VAKRA shows us exactly how far we have to go.