Step-by-Step Tutorial: Building a Generator-Validator Agentic Workflow with Patronus AI and Tool-Calling Agents for Reliable Math Solving

January 24, 2026

Step-by-Step Tutorial: Building a Generator-Validator Agentic Workflow with Patronus AI and Tool-Calling Agents for Reliable Math Solving

Key Takeaways * Large Language Models (LLMs) are fundamentally unreliable for math because they are trained to predict words, not perform precise calculations. To fix this, they need specialized tools. * The Generator-Validator (G-V) pattern uses two AI agents—one to solve a problem and another to check the work—to create a reliable, self-correcting system. * Building trustworthy AI requires observability. Using platforms that trace every step of an agent's process is crucial for debugging and ensuring accuracy.

I recently asked a top-of-the-line LLM to calculate the remaining balance on a simple loan. It confidently gave me an answer that was off by over $2,000. It wasn't just wrong; it was confidently wrong.

This is the dirty little secret of most large language models: they're brilliant illusionists but terrible accountants. They're trained to predict the next word, not to perform precise calculations, which makes them fundamentally unreliable for tasks that require 100% accuracy. So, how do we build AI systems we can actually trust with numbers?

The Reliability Problem: Why LLMs Fail at Math

Before we can fix the problem, we have to respect it. Simply throwing a math problem at a standard chatbot is a recipe for disaster. It's like asking a poet to do your taxes.

The 'Hallucination' Trap in Arithmetic

When an LLM "calculates" 25 + 37, it isn't using a calculator. It's scanning its vast training data for patterns and trying to predict a plausible-sounding answer. For complex problems, it can easily "hallucinate" an incorrect step or a wrong digit, leading to a flawed result without warning.

The Need for Verifiable Processes vs. Black Box Answers

The core issue is the lack of a verifiable process. A black box that just spits out an answer is useless if you can't see its work. This is where agentic workflows come in, moving us from a single, unreliable model to a team of specialized AI agents.

Introducing the Generator-Validator (G-V) Agentic Workflow

Breaking down complex problems is key, and the Generator-Validator (G-V) pattern is a perfect example of this philosophy. Instead of one agent doing everything, we create a two-agent system: a worker and a supervisor.

What is a Generator Agent? (The 'Solver')

The Generator is the "doer." Its only job is to take a problem and produce an initial solution. In our case, it will be the agent that reads a math problem, figures out which operations are needed, and performs the calculations.

What is a Validator Agent? (The 'Checker')

The Validator is the quality control inspector. It doesn't solve the problem from scratch. Instead, it takes the Generator's original problem and its proposed solution and scrutinizes the work to verify each step.

How They Work Together for Self-Correction

The Generator provides an answer, and the Validator checks it. If it's correct, the process is complete. If it's wrong, the Validator can provide feedback for the Generator to try again, creating a self-correcting system that is far more reliable.

Our Tech Stack: Patronus AI and Tool-Calling Agents

To build this, we need the right tools. I've been experimenting with Patronus AI, and it's uniquely suited for building and, more importantly, observing these multi-agent systems.

Prerequisites: Setting up Your Environment

First, get your environment ready. You'll need the Patronus AI library and an LLM to power your agents. I'm using LiteLLM to route to my model of choice, but you can use what you're comfortable with.

pip install patronus-ai

Why Patronus AI is Ideal for Building and Evaluating Validators

The killer feature of Patronus for this workflow is its tracing capability. By adding a simple decorator (@patronus.traced()), you can log every single step of your agentic workflow: every prompt, tool call, and output. Observability is non-negotiable for building reliable AI because you can't fix what you can't see.

This level of transparency is crucial for building systems that are fundamentally more secure. The lack of observability in many AI platforms can create hidden security risks. Patronus gives you the microscope you need to prevent that.

The Critical Role of a Calculator Tool for our Agent

To stop our LLM from hallucinating math, we're going to give it a real calculator. We'll define a set of Python functions for basic arithmetic that our agent can call. This is called "tool-calling," and it's the key to grounding our agent in reality.

# Define our basic math tools
def add(a: float, b: float) -> float:
    return a + b

def subtract(a: float, b: float) -> float:
    return a - b

def multiply(a: float, b: float) -> float:
    return a * b

def divide(a: float, b: float) -> float:
    return a / b

# Create a list of tools for our agent to use
TOOLS = [add, subtract, multiply, divide]

Step-by-Step: Building the Generator Math Agent

Let's build our first agent, the "Solver."

Step 1: Defining the Agent's Prompt and Persona

We need to give our agent clear instructions. It's not just a calculator; it's a helpful assistant that shows its work.

GENERATOR_INSTRUCTIONS = (
    "You are a helpful mathematical assistant.\n"
    "• Think step-by-step and call the appropriate calculator tool(s).\n"
    "• Show your working and clearly state the final answer."
)

Step 2: Integrating the Calculator Tool

Now, we'll create the agent using ToolCallingAgent from Patronus AI, passing our LLM, our TOOLS, and our instructions. I'm setting return_full_result=True because I want to see everything—the agent's reasoning, which tool it called, and the final output.

from patronus.ai import ToolCallingAgent

# Assume router_llm is your configured LLM
def get_calculator_agent():
    return ToolCallingAgent(
        model=router_llm,
        tools=TOOLS,
        return_full_result=True, # Critical for observability!
        instructions=GENERATOR_INSTRUCTIONS
    )

Step 3: Running a Test Case and Observing its Output

Let's give it a simple problem: "What is 25 + 37?" The agent won't guess; it will reason that it should use the add tool. It then calls our Python function, gets the result 62, and presents it—verifiable and correct.

Step-by-Step: Building the Validator Agent with Patronus AI

Now for the supervisor. The Validator agent's job is to double-check the Generator's work.

Step 4: Crafting the Validator's 'Scrutinizer' Prompt

The Validator needs a different set of instructions. It's not here to solve, but to verify.

VALIDATOR_INSTRUCTIONS = (
    "You are a math validator.\n"
    "• You will be given an original problem and a proposed solution from a generator agent.\n"
    "• Your task is to verify the generator's answer step-by-step.\n"
    "• State clearly whether the solution is 'Correct' or 'Incorrect' and explain why."
)

def get_validator_agent():
    return ToolCallingAgent(
        model=router_llm,
        tools=TOOLS, # It needs the tools to verify the calculations
        instructions=VALIDATOR_INSTRUCTIONS
    )

Step 5: Setting Up an Evaluation in Patronus AI to Embody the Validator

While our second agent acts as the validator, the Patronus AI platform is where we evaluate the entire system. The @patronus.traced() decorator will capture the full conversation between the Generator and Validator.

Step 6: Defining Pass/Fail Criteria Based on the Validator's Verdict

Inside the Patronus dashboard, we can analyze the traces. Our success metric is simple: does the Validator agent output the word "Correct"? The dashboard allows you to see every detail, making debugging a breeze.

Putting It All Together: Running the Full G-V Workflow

Let's connect our two agents into a single, traceable workflow.

import patronus

@patronus.traced("generator_validator_flow")
def demonstrate_agent_interaction(problem: str) -> str:
    print(f"Problem: {problem}")

    # Step 1: Generator agent computes the solution
    generator = get_calculator_agent()
    gen_result = generator.run(problem)
    print(f"\nGenerator Output:\n{gen_result}\n")

    # Step 2: Validator agent checks the work
    validator = get_validator_agent()
    validation_prompt = (
        f"Original problem: '{problem}'\n"
        f"Generator's proposed solution and reasoning: '{gen_result}'\n"
        f"Please validate this solution."
    )
    final_result = validator.run(validation_prompt)
    print(f"\nValidator Output:\n{final_result}\n")

    return final_result

# Let's run it!
demonstrate_agent_interaction("What is (5 * 10) + (100 / 4)?")

Inputting a Complex Math Problem

I used a multi-step problem: (5 * 10) + (100 / 4).

Observing the Generator's 'Chain of Thought' Output

The Generator agent will first call multiply(5, 10) to get 50, then divide(100, 4) to get 25. Finally, it calls add(50, 25) to get the final answer: 75. It will output this entire reasoning process.

Analyzing the Validator's Automated Verdict in Patronus AI

The Validator receives the problem and the Generator's full output. It will re-run the calculations using its own tools and confirm that (5 * 10) + (100 / 4) indeed equals 75. Its final output will state that the solution is correct and explain why.

The Patronus trace for this run will show a complete, auditable record of a reliable computation. It will show two distinct agent interactions, the tool calls made by each, and the final confirmation.

Conclusion: Beyond Math - Applying the G-V Pattern for Enterprise-Grade AI

This isn't just about building a fancy calculator. This Generator-Validator pattern is a foundational technique for creating reliable AI for any domain where accuracy is critical. Here are some final thoughts on building these systems.

First, don't trust, verify; never let a single LLM be the sole source of truth for critical tasks. Second, specialize your agents by breaking down complex tasks into smaller roles like solver, checker, or planner. Finally, embrace observability, as you cannot build reliable systems if you're flying blind.

Where do you go from here? Try expanding the toolset with more complex functions, or implement a retry loop where the Generator tries again if the Validator finds an error. This G-V pattern is the first step toward building more complex, multi-agent systems that can handle far more than just arithmetic.

Think code generation and validation, report writing and fact-checking, or legal contract drafting and clause verification. The possibilities are huge once you start building for reliability.