guide5 min read

Ai Agent Math Mistakes Data

Ai Agent Math Mistakes Data

AI agents make consistent math mistakes on data tasks — unit conversion, rounding, aggregation order, null handling. The mistakes are not random; they are systematic, and they compound when agents pass numeric results to each other. The fix is to stop asking the LLM to do math and start making it write SQL that the warehouse executes.

This guide catalogs the six most common agent math mistakes in data work, explains why they happen, and walks through the SQL-first pattern Data Workers uses to eliminate numeric hallucination.

Why LLMs Are Bad at Math

LLMs are trained to predict plausible text, not to execute arithmetic. They pattern-match numeric operations from training data, which works for textbook examples and fails for real data at scale. An agent asked to average a million rows will confidently report a plausible but wrong number because the real operation never actually ran.

The Six Common Mistakes

  • Unit conversion — treating cents as dollars, or seconds as milliseconds
  • Aggregation order — summing before filtering, or averaging an average
  • Null handling — including or excluding nulls inconsistently
  • Rounding drift — accumulated floating-point error on large sums
  • Percent of total errors — comparing filtered numerator to unfiltered denominator
  • Time zone mistakes — bucketing UTC timestamps by local day

The SQL-First Fix

Data Workers agents never do math in prose. When a user asks for a number, the agent writes SQL, the warehouse executes it, and the result comes back as a validated numeric value. The agent's job is to translate intent into SQL, not to calculate. This eliminates unit errors, aggregation errors, and rounding drift in one move.

Validation Layers

SQL alone is not enough — the LLM can still write incorrect SQL. Data Workers runs three validation layers on every numeric query. First, a schema check against the catalog to confirm the columns exist and have the expected units. Second, a range check on the result (negative revenue, future dates, impossible percentages). Third, a consistency check against historical values to flag sudden jumps.

Unit Awareness in the Catalog

The catalog stores unit metadata (cents vs dollars, seconds vs milliseconds, UTC vs local) and the agent reads it before writing any query. If the user asks for 'revenue this month' and the catalog says the amount column is in cents, the agent multiplies or divides appropriately and labels the output. See how this integrates with autonomous data engineering.

Cross-Agent Numeric Handoffs

When one agent produces a number and another consumes it, the number gets passed as a structured object with value, unit, source query, and confidence — not as a string. The consuming agent knows exactly what the number represents and cannot silently convert units wrong.

Eval Suites for Numeric Correctness

Data Workers ships a 200-query golden set for numeric tasks, with known-correct answers validated by domain experts. Every model release runs against the suite. Regressions are caught before shipping. For the broader approach to agent eval, see AI for data infrastructure.

When LLM Math Is Acceptable

Small-scale arithmetic on a few values (converting one number, computing a ratio between two explicit inputs) is fine as long as the inputs are in the prompt. Large-scale math (aggregating thousands of rows, computing percentiles, joining distributions) is never fine. The rule: if it runs over more than ten values, push it to SQL.

LLMs do not do math; they predict text that looks like math. Push arithmetic to SQL, validate units in the catalog, and check results against ranges and history. To see the pattern running on a real warehouse, book a demo.

One of the sneakier failure modes is when the LLM confidently computes a ratio from numbers it fabricated. It will say something like 'revenue grew 12 percent month-over-month' based on numbers it never actually ran a query for. The percentage is plausible, the delta is plausible, and the underlying numbers do not exist. The only defense is to require every numeric claim to cite its source query, and to cross-check the query against the catalog. Data Workers' pipeline agents refuse to emit numeric claims without a citation, which eliminates this failure mode entirely.

Time zone mistakes are the most common LLM math bug we see in production. An agent asked for 'users who signed up on April 5' will often return users whose UTC timestamp falls in a different local day than the one the user meant. The fix is to always specify a time zone explicitly in the query and to surface the time zone in the result. Data Workers catalogs store the time zone convention per column and inject it into every query. Teams that skip this step see 10 to 20 percent of time-based queries return wrong-but-plausible answers.

A useful guardrail is numeric output schemas. Every numeric answer the agent produces should include value, unit, source_query, confidence, and timestamp. Consumers (other agents, dashboards, humans) read the full object, not just the number. This eliminates a whole class of consumption-side mistakes because the unit travels with the value, and nobody has to remember it. Data Workers' agents emit structured numeric outputs by default, and the pattern is cheap to add to any agent stack.

The deepest lesson from triaging agent math bugs is that LLMs are pattern matchers, not calculators. They pattern-match numeric operations from their training data and produce plausible answers. For small inputs with clear examples in training data, the pattern match often works. For large inputs, novel aggregations, or unusual conventions, the pattern match breaks and the answer is wrong. Always route math to a deterministic executor (SQL, Python, Arrow) and reserve the LLM for translation and reasoning.

Stop asking the LLM to do math. Write SQL, validate units, check ranges. Everything else is hallucinated arithmetic.

See Data Workers in action

15 autonomous AI agents working across your entire data stack. MCP-native, open-source, deployed in minutes.

Book a Demo

Related Resources

Explore Topic Clusters