002 / 2025 · 2025
A bigger model wasn't the answer: what actually makes an analytics agent hold up
A lean prototype measured that for cents per analysis and kept the budget out of the wrong investment.
~$0.07 per analysis · 46% baseline success rate · 2 bottlenecks identified · a validated roadmap, not guesswork
Context
A data-driven company wanted to know whether a conversational analytics agent could answer the same recurring request from its business teams: “Show me the numbers, without me needing SQL or tying up the reporting team.” But before committing to a large investment, the client wanted a different question answered first: would it even hold up on their own data? They brought me in for a deliberately small proof of concept, with a handful of domain experts as reviewers.
Challenge
The real question was never “does an LLM work?” but “does it work well enough on our data and our vocabulary to trust an answer?”
Analytical questions from the business are rarely clean. They’re ambiguous, they assume domain knowledge, and the same words often mean something different to the business than they do to the database. An agent that returns a number here that looks plausible but is wrong is worse than no agent at all. It erodes trust, and trust is the entire currency of self-service analytics.
On top of that came the constraints: no large up-front investment, a transparent cost model per analysis, and a setup that delivers solid results in weeks, without months of infrastructure work.
Approach
I built the agent with DSPy around the OpenAI API — deliberately without a zoo of MCP servers. In DSPy, plain Python functions are enough to give the agent its tools. That saves an entire dependency layer, keeps the PoC light, and cuts the path from idea to tested capability down to minutes instead of days.
For the model, the choice was GPT-5-mini. It keeps the cost at roughly $0.07 per analysis, including tokens, data scanned, and agent runtime. In a PoC that runs hundreds of analyses, that’s the difference between “we experiment freely” and “every test run hurts.”
The agent had access to the Data Catalog and to SQL, so it could query the data itself. On top of that came three capabilities that set it apart from a plain text-to-SQL translator:
- Asking clarifying questions instead of guessing when a request is unclear.
- Assessing data freshness, so an answer never quietly rests on stale data.
- Self-learning — the agent curates its own domain knowledge from conversations with the domain experts.
Behind that sits a two-tier memory: short-term per session, long-term as domain knowledge. Kept lean: the long-term memory is, for now, a YAML file that a dedicated sub-agent can read and update. No vector store, no extra database, as long as it isn’t yet clear whether the effort even pays off.
But the core of the project wasn’t the agent. It was its evaluation.
Evaluation
On top of promptfoo, I built a custom evaluation framework that tests the agent broadly and fast. Not just “does the right number come out?” but along three dimensions:
- Domain — does it answer real business questions correctly?
- Behavior — does it ask when it should ask? Or does it invent an answer?
- UX — how does it decide to present the data: as a number, a table, an aggregate?
That let me test the agent reproducibly against a large range of real requests. The result was honestly sobering — and valuable precisely because of it: a 46% success rate once you factor in all the constraints. Low enough not to go to production. Precise enough to understand why.
Diagnosis
The root-cause analysis surfaced the two bottlenecks: two underlying causes that separate cleanly.
1. Interpreting the request. The agent lacked domain knowledge and an understanding of the topic being asked about. A model behavior makes this worse: OpenAI models tend to want to please the user. They answer a question emphatically even when it’s ambiguous, or when it’s unclear whether the answer is correct — the exact opposite of what you want in analytics.
2. Interpreting the data. Two weaknesses showed up here. First, some datasets had critical quality issues, others minor ones. Second, and more serious, the data is currently structured more around the operational view than the business view. So the agent has to bridge the gap between the question’s language and the actual data structure on its own. And that’s hardest exactly where it matters most.
Self-learning worked well, until it hit a ceiling. Whenever domain experts contradicted an analysis without fully understanding the underlying data structure, the agent risked learning the wrong “knowledge.” This isn’t a cosmetic flaw; it’s a real risk: letting users steer the learning process unchecked opens the door to corrupted knowledge, including the deliberate kind.
Takeaway & Outlook
The most valuable output of this PoC isn’t the agent. It’s the certainty about where to invest and where not to.
The obvious reflex would be to just reach for a bigger model. I tested it: the same setup with GPT-5 lands on exactly the same success rate, at four to five times the cost. So the problem sits below the agent, in the foundation of data and knowledge it works on. The lever for good analytics agents isn’t the model, it’s the foundation.
That’s where the work continues. The next steps with the client:
- model the data around the business view, not the operational one,
- measure and monitor data quality instead of assuming it,
- provide curated, validated domain knowledge the agent doesn’t have to guess at itself.
And perhaps the most important, transferable lesson for anyone planning something like this: self-learning, yes — but the learning has to come from curated, validated domain knowledge, not from unchecked user feedback. Otherwise the agent learns its users’ mistakes and misunderstandings along with everything else. And, in the worst case, their intentions.
If you’re planning an analytics or AI agent and want to know whether it’ll hold up on your data, I answer exactly that with a lean, measurable PoC — before the big budget starts flowing.