math-mcp: Giving LLMs a Calculator That Knows It's a Calculator

By Matthew Hunter | May 10, 2026 | mcp, claude-code, golang, math, finance, statistics

Hook

LLMs do arithmetic in their heads. Mostly they’re close. Occasionally they’re off by enough to matter — a mortgage payment that’s $90 too high, an opportunity-cost claim that’s understated by 10%, a loan payoff term that ignores the interest still accruing while you pay it down. The model doesn’t know which case it’s in. Neither do you, unless you check.

math-mcp is a small MCP server that gives the model somewhere else to send those questions. It exposes ~55 discrete tools: Go’s math standard library, gonum’s statistical aggregates, and razorpay/go-financial’s time-value-of-money functions, each wrapped as its own tool so the LLM picks them out of its tool list by name and calls them directly. One tool per function, no expression evaluator. The point is to make “I should not estimate this” the easy path.

Philosophically, what LLMs do is not rigorous computation but statistical prediction. That violates one of the fundamental assumptions of AI in popular definition: that AI should be able to compute like a computer and think like a human, depending on which one is better. This MCP server tool is intended to help close that gap for basic math.

The problem

Large language models are extraordinarily good at most things and indifferently good at floating-point arithmetic. They produce numbers fluently. The fluency is the problem: there’s no surface signal distinguishing “I computed this” from “I generated this.” A 30-year mortgage payment on $300k at 7% APR is one of those questions a model will answer confidently, and the answer will usually be within $5 of right. Usually. The handful of times the answer is wrong by $50, or applies the wrong sign convention, or assumes the wrong compounding frequency, are silent.

The narrower failure mode is multi-step financial reasoning. Given starting capital plus periodic contributions plus a real rate of return over twenty-one years, an LLM cannot reliably produce a future value to within 5%. It will offer a number; the number is approximately a vibe. For planning work, that’s a problem.

Design choices

The whole project is roughly a thousand lines of Go. The interesting parts are not the code — they’re the choices about how to expose computations to the model.

One tool per function, not a generic expression evaluator. A math_eval("2 * pi * r") style tool reintroduces the failure mode: the LLM can still write the expression wrong. Discrete tools (math_sin, math_log, financial_pmt) force structured arguments and put every available primitive in the tool list, which is where the model decides what to call.

Echo inputs back in every response. TVM functions are notorious for argument-order confusion — swap PV and FV in financial_pv and you get a plausible-looking but wrong number with no warning. The server includes the parsed inputs in the response so the model can verify it sent what it meant. This is a second line of defense against the silent-wrong-answer failure mode.

Decimal in, decimal out on the financial surface. All money values pass as decimal strings end-to-end via shopspring/decimal. A result_float field is provided for convenience, but it’s explicitly marked lossy in the schema. The canonical answer is a string the LLM has to read but cannot accidentally truncate.

Domain errors return the offending value. sqrt(-42.5) returns "math_sqrt: input must be non-negative (got -42.5)", not a bare "invalid input". The model uses the offending value to self-correct without a round trip to ask “what did I send?”

Conventions explicitly stated in tool descriptions. Trig functions advertise that they take radians and point at math_deg_to_rad as the escape hatch. math_log says “natural logarithm” prominently and cross-references the base-10 and base-2 variants. math_round flags that it uses half-away-from-zero rounding, not banker’s. math_mod calls out that the result sign follows the dividend (Go/C convention; Python’s % follows the divisor). These are exactly the conversions where an LLM, having read code in many languages, will quietly apply the wrong convention.

Three surfaces

math_* — ~38 wrappers over Go’s math package: trig and inverse trig, hyperbolic and inverse hyperbolic, exp/log family, pow/hypot, gamma/erf, conversions, plus a math_constants tool returning π, e, φ, and friends at full float64 precision.

stats_* — ~10 aggregates: sum, mean, median, min, max, sample and population variance and standard deviation, percentile, and Pearson correlation, all over arrays of floats. Two deliberate convention choices here: median averages the two middle values for even-length input (the textbook definition; numpy and Excel agree), and percentile uses linear interpolation between order statistics — R’s type 7 method, matching np.percentile and Excel’s PERCENTILE.INC. The 75th percentile of [1, 2, 3, 4] is 3.25.

financial_* — 8 time-value-of-money functions: PMT, IPMT (interest portion), PPMT (principal portion), PV, FV, NPV, NPER, and RATE. Sign convention follows numpy and Excel: money received is positive, money paid out is negative. NPV follows the numpy_financial convention specifically — cash_flow[0] is period 0 and is not discounted — and the tool description spells out the porting recipe for users coming from Excel’s NPV(), which treats the first value as period 1 instead. IRR and MIRR aren’t exposed because the underlying library doesn’t ship them; they’re deliberately omitted rather than approximated.

Real-world validation

I built this and immediately had a use for it: I keep a hand-maintained planning document with mortgage math, retirement projections, and a couple of loan amortizations. The document has been carefully edited by a human (me) who knows the formulas. The natural test was to point the new tool at the existing numbers and see what disagreed.

Two errors surfaced. A mortgage-payment comparator was off by about $90 per month because the original calculation rounded by hand and slightly missed; a loan payoff term was off by three months because the napkin math (principal divided by monthly payment) ignored interest accruing against the declining balance over the payoff period. Neither error would have changed any decision in the document. Both restored internal consistency once corrected.

The deeper lesson was about the category of error. Both mistakes were the kind that look right because they’re close to right. Neither would have caught my eye on a re-read because they passed the “feels about that big” check. They were exactly the failures the tool exists to surface. A precise number on the same problem disagreeing by a small amount is more useful than a precise number from scratch, because the gap between the careful estimate and the exact answer is where the value lives.

The mortgage-payment math an LLM does in its head is usually within $5. The percentile math is usually within a tenth. The 21-year compound-growth math is genuinely a coin toss inside a 20% band. Knowing which case you’re in matters less when there’s an MCP tool registered and the model uses it for all of them.

Worth noting: I have uses for financial calculations that are trustworthy over long time horizons (amortization tables, etc); that’s why I built this. Exposing Go stdlib math functions is an easy win. Bringing in basic stats functions is potentially useful. Whether the LLM uses the exposed math functions rather than guessing without being told to is something I’ll be watching for.

What didn’t make it

A few intentional gaps worth naming.

No expression evaluator. I keep being tempted; I keep not adding it. The whole project is structured against the model writing arithmetic of any kind. An evaluator would re-open the floor.

No IRR or MIRR. The library doesn’t expose them. I could implement Newton-Raphson over financial_npv to add them, and may eventually, but the bar is “is there a real use case I keep hitting.” There isn’t yet.

Float64, not decimal, on the math and stats surfaces. Decimal makes sense for financial work where pennies matter; for transcendental functions and most aggregates, the precision-vs-speed trade isn’t worth the surface complexity. The financial surface gets decimal because the failure mode (off-by-a-cent compounding over 360 amortization rows) is real there and absent elsewhere.

No matrix or optimization surface. gonum includes both. Maybe later. Right now this serves the questions I find an LLM actually getting wrong; matrix inversion is not on that list.

Worth saying out loud: the tool returns numbers, not analysis. Whether to take the mortgage is a different question; whether the mortgage payment math is right is what this answers. Don’t confuse precision for prescription.

What I learned about MCP tool design

A few things crystallized while building this that weren’t obvious going in.

Tool descriptions matter. The model picks tools by name, but it parameterizes them by reading the description. A description that says “Returns sin(x)” is correct but useless; one that says “Returns sin(x). x is in RADIANS, not degrees — use math_deg_to_rad if you have degrees” pre-empts the most common failure mode at the cost of fifteen words. The same description-tightening exercise paid off for math_mod, math_round, financial_npv, and the sample-vs-population variance pairs.

Echoing inputs is cheap and high-signal. A few bytes of structured response output gives the model a way to verify what was sent. For tools where argument order is hard to keep straight (TVM functions especially), it’s the difference between “I trust this number” and “this number could be a transcription error away from the intended answer.”

Strict schemas mean the LLM never has to guess valid values. The financial tools that take a when parameter advertise the valid values ("end" and "begin") as a JSON Schema enum. Strict MCP clients reject "midnight" before the request reaches the server. The model sees the constraint in the schema and never gets the chance to invent values.

Test through the same transport production uses. The Go SDK’s mcp.NewInMemoryTransports() lets tests exercise the full register-and-dispatch path — schema generation, serialization, dispatch, response shaping — instead of just unit-testing the wrapped functions. This is where the “75th percentile uses linear interpolation” contract gets locked in: a test that pins percentile([1,2,3,4], 75) = 3.25 over the wire, not in the implementation. Future me could swap implementations and the contract stays honest.

Closing

A thousand lines of Go is a pleasant size for a tool that does exactly one thing. The thing it does is push the model toward calling a calculator instead of acting like one. Most of the time the LLM-in-its-head answer would be close enough; the entire premise of math-mcp is that “close enough” is not a property you want to be guessing at after the fact. Better to make the precise answer be the easy answer, and let the LLM treat the calculator the same way I do.

GitHub: github.com/matthewjhunter/math-mcp
Built with: modelcontextprotocol/go-sdk , gonum , razorpay/go-financial , shopspring/decimal

Certifications

Publications