Agent = LLM + Harness: building a tool-calling agent from scratch

For the last year the concept of an agent has really taken off and is at this point in time impossible to ignore. However, in a race where we have seen many introductions of new agentic frameworks and builders, it seems to me we are losing a bit of track of what these agents actually are under the hood. What is the modern definition of an agent we are working with? What is the actual difference between an LLM and an agent? What are intended use cases for agents and is an agent the answer to all AI-related questions? Are there other AI approaches which may be better suited for your use case instead of throwing an agent at it? What does the concept of a harness entail which recently gained traction through Anthropic’s tooling?

Dirk Kemper 2026-06-15

Definition

First let’s define what an agent is and how this relates to an LLM. During my computer science studies I took the class Multi-Agent Systems, which presented a decades-old agent definition by Wooldridge which defined an agent as an autonomous system perceiving their environment and taking actions in it. This is so broad that by this definition even a room thermostat can be classified as an agent. For the last years the world has converged on a definition where agents are always based on LLMs, where the LLM dynamically decides on its own tool usage and the process to follow. The LLM decides for itself which path to follow without a developer-set control flow. Without a real academic definition at this point, I will be using the Anthropic blog post at https://www.anthropic.com/engineering/building-effective-agents which tries to resolve this.

From this post I am distilling the following key properties of an agent:

The agent runs in an endless loop until completed or stopped by a condition
The agent is able to autonomously call tools
After a tool call the agent observes the environment and decides the next step
An agent run starts from a human command or task

The main difference between an agent and a predefined workflow that calls and orchestrates LLMs, is that the agent is able to decide its own pathway through the sequence of workflow steps and may stop execution at its own discretion. This means that not only the responses of the agent become probabilistic, but the execution of the workflow as well.

This is also the main differentiator between AI being an agent and “not an agent”: the agent is having its control flow decided upon by the model and not a developer. When there is a manual workflow present, for instance when working with a RAG pattern I have documented in this blog post, you may have an AI at hand but just not an agent.

The agent loop

An agent runs in a loop which can be terminated by either completing its task or by a stopping condition. Schematically this looks as follows:

Figure 1 - The agent loop

Let’s add an example to this: let’s say we are developing an agent inspecting data quality issues in a material master data table from an SAP system. As part of the agent initialization, we will do the following:

We create a prompt indicating what the agent’s task should be. In this case we tell the agent it is a data quality inspection expert and needs to flag any data quality problems in the dataset it can acquire. Add a few examples of quality problems to this, like missing, zero or null data.
We initialize the agent with a few tools that the agent is able to use. These “tools” are function definitions of which we describe their input parameters and what they do. This is the core of what differentiates an agent from an LLM you can only chat with. In this case we define the following functions: search_materials (for retrieving the full material list), get_material (for getting the full details of a single material) and flag_material (for flagging a material for correction). Apart from documenting each function, the LLM will be able to infer the meaning of each function just from its name, like you would yourself.

We now kick off the agent process. Because we provide the tools definitions together with the system prompt at LLM startup, the LLM quickly decides it wants to call the tool for retrieving the full material list and append that to its output.

Note that a “tool call” from an LLM is not an actual technical function call to a piece of logic, but a special symbol emitted by the LLM that it likes a tool to be run. This special symbol is picked up by the runtime surrounding the agent which parses the string response from the LLM, picks up the tool call symbol and calls the actual tool at this point. We are now at the point to define the harness, because that is exactly what the runtime process around the LLM is: the machinery that performs the actual tool calls, manages the loop, and feeds each result back into the model.

Simply put: Agent = LLM + Harness.

Conversation example

The agentic concept may become clearer when looking at the low-level conversation output from the LLM itself. Remember that the only task of the LLM is still to continue a sequence of tokens. This was true in the days of the early LLMs and is still true in the era of agents. Also note that there are no “messages” or “tool” objects at all, there is still just the single flat string that the model continues, now expanded with a set of text conventions.

<|im_start|>system
You are a master-data quality assistant. Investigate records before flagging.

# Tools

You are provided with function signatures within <tools></tools> XML tags:

<tools>
{"name": "search_materials", "parameters": {"type":"object","properties":{"category":{"type":"string"},"status":{"type":"string"}}}}
{"name": "get_material", "parameters": {"type":"object","properties":{"material_id":{"type":"string"}},"required":["material_id"]}}
{"name": "flag_material", "parameters": {"type":"object","properties":{"material_id":{"type":"string"},"issue":{"type":"string"}},"required":["material_id","issue"]}}
</tools>

For each call, return json within <tool_call></tool_call> tags.<|im_end|>

<|im_start|>user
Check the chemicals category for data-quality problems and flag anything broken.<|im_end|>

<|im_start|>assistant

You will notice a few special symbols in the above message:

<|im_start|> These are “input message” markers, indicating where the system and user messages start. The system message is the overarching instruction to the LLM on how to respond and behave, where the user message is the first message the user gives. The <|im_start|>assistant marker is used to indicate to the LLM to start generating.
The <tools> opening and closing tags wrap a JSON snippet listing the available functions to call, including their input parameters.

The conversation now continues:

I'll list the chemicals first.

<tool_call>
{"name": "search_materials", "arguments": {"category": "chemicals"}}
</tool_call><|im_end|>

Here you can see the requested tool call, stopping with an <|im_end|> marker which tells the LLM to halt generating. At this point the harness will parse the JSON snippet in the <tool_call> block, calls the corresponding function and adds the function output to the message:

... everything from turn 1 ...

<|im_start|>assistant
I'll list the chemicals first.

<tool_call>
{"name": "search_materials", "arguments": {"category": "chemicals"}}
</tool_call><|im_end|>

<|im_start|>user
<tool_response>
[{"id":"M-1002","name":"Industrial solvent 5L"},
{"id":"M-1005","name":"Lithium battery pack"}]
</tool_response><|im_end|>

<|im_start|>assistant

Because the model is stateless, the entire transcript is re-sent every turn. This means model doesn’t “remember” anything and the growing string is the memory.

The loop diagram at the beginning of this post showed a “human approval” step. As you can imagine, this is also implemented by the harness and requires it to implement some form of approval prompt or dialog to the user before executing the tool call.

Lightweight implementation

I will now show an example of a lightweight implementation of this concept with a minimal amount of external libraries, with dependencies on FastMCP and the OpenAI libraries only. FastMCP is a package for spinning up a simple MCP server for hosting the functions which are callable from the agent. MCP is a standardized protocol which originated from Anthropic which allows you to build a backend using your preferred tooling to expose and serve functions to agents.

The code is available at the following Github repo: https://github.com/kemperd/agent-blog

Let’s have a look at the MCP server (tools_server.py) first:

from __future__ import annotations

from fastmcp import FastMCP

mcp = FastMCP("master-data-tools")

# --- A tiny in-memory "system of record" ------------------------------------
# In a real deployment these tools would query S/4HANA, a database, or an API.
# We keep an in-memory dict so the example runs with zero infrastructure.
# A few records have deliberate data-quality problems for the agent to find:
#   M-1002  chemical without hazard_class
#   M-1003  missing weight
#   M-1005  chemical with zero weight AND no hazard_class
_MATERIALS: dict[str, dict] = {
    "M-1001": {"id": "M-1001", "name": "Steel bolt M8", "category": "hardware",
               "unit": "EA", "weight_kg": 0.012, "hazard_class": None, "status": "active"},
    "M-1002": {"id": "M-1002", "name": "Industrial solvent 5L", "category": "chemicals",
               "unit": "EA", "weight_kg": 5.4, "hazard_class": None, "status": "active"},
    "M-1003": {"id": "M-1003", "name": "Cardboard box L", "category": "packaging",
               "unit": "EA", "weight_kg": None, "hazard_class": None, "status": "active"},
    "M-1004": {"id": "M-1004", "name": "Copper wire 100m", "category": "hardware",
               "unit": "ROL", "weight_kg": 8.2, "hazard_class": None, "status": "active"},
    "M-1005": {"id": "M-1005", "name": "Lithium battery pack", "category": "chemicals",
               "unit": "EA", "weight_kg": 0.0, "hazard_class": None, "status": "active"},
}

@mcp.tool
def search_materials(category: str | None = None, status: str | None = None) -> list[dict]:
    """List material master records, optionally filtered by category and/or status.

    Returns summary rows (id, name, category, status) only -- NOT the full record.
    Use get_material to inspect the full detail of a single record.
    """
    rows = []
    for m in _MATERIALS.values():
        if category and m["category"] != category:
            continue
        if status and m["status"] != status:
            continue
        rows.append({k: m.get(k) for k in ("id", "name", "category", "status")})
    return rows

@mcp.tool
def get_material(material_id: str) -> dict:
    """Return the full master-data record for a single material id."""
    m = _MATERIALS.get(material_id)
    if m is None:
        return {"error": f"No material found with id {material_id!r}"}
    return m

@mcp.tool
def flag_material(material_id: str, issue: str) -> dict:
    """Flag a material record for data-quality review.

    This MUTATES the record (sets status to 'review' and records the issue).
    It is a side-effecting action and should be gated behind human approval.
    """
    m = _MATERIALS.get(material_id)
    if m is None:
        return {"error": f"No material found with id {material_id!r}"}
    m["status"] = "review"
    m["quality_issue"] = issue
    return {"ok": True, "id": material_id, "new_status": "review", "issue": issue}

if __name__ == "__main__":
    mcp.run()

At the top of the file you can see a _MATERIALS dictionary being constructed, forming a simplified version of a material master data table. In reality this will come from an ERP system like S4/HANA.

Here you can see the main functions the agent is allowed to call:

search_materials: retrieves the full material list (as defined in the source in a dict, with a few built-in errors)
get_material: retrieves all data of a specific material
flag_material: flag material for data quality review. This requires human intervention before the agent is allowed to call it.

You can also see how simple it is to expose a Python source file using MCP by using FastMCP’s @mcp.tool decorator.

Now we have defined the functions for the agent, let’s have a look at the agent implementation (agent-minimal.py) itself. If you are running the code yourself, make sure to set environment variable OPENAI_API_KEY pointing to your OpenAI API key.

from __future__ import annotations

import asyncio
import json

from fastmcp import Client
from openai import OpenAI

LLM = OpenAI()
MODEL = "gpt-5-mini"

# Tools requiring explicit human approval before they run
SENSITIVE_TOOLS = {"flag_material"}

# A guard so a confused model can't loop forever.
MAX_STEPS = 10

SYSTEM_PROMPT = (
    "You are a master-data quality assistant. You can search material records, "
    "inspect them, and flag records that have data-quality problems. A record is "
    "problematic if a required attribute is missing or implausible, e.g. a "
    "missing or zero/negative weight or a chemical with no hazard class. "
    "Investigate records before flagging them, and explain each flag."
)

def mcp_tools_to_openai(tools) -> list[dict]:
    """Convert MCP tool definitions into the OpenAI function-tool schema."""
    return [
        {
            "type": "function",
            "function": {
                "name": t.name,
                "description": t.description or "",
                "parameters": t.inputSchema,
            },
        }
        for t in tools
    ]

def tool_output(result) -> object:
    """Pull a JSON-serialisable result out of an MCP CallToolResult."""
    if getattr(result, "data", None) is not None:
        return result.data  # FastMCP deserialises structured returns for us
    return "".join(getattr(block, "text", "") for block in getattr(result, "content", []))

def approve(name: str, args: dict) -> bool:
    """Human-in-the-loop gate for side-effecting tools."""
    print(f"\n[approval] The agent wants to call `{name}` with:")
    print(json.dumps(args, indent=2))
    return input("[approval] Approve? [y/N] ").strip().lower() == "y"

async def run(task: str) -> str:
    async with Client("tools_server.py") as client:
        tools = mcp_tools_to_openai(await client.list_tools())
        print(f'Registering the following tools with LLM: {tools}')
        messages = [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": task},
        ]

        for _ in range(MAX_STEPS):  # <-- the agent loop
            reply = LLM.chat.completions.create(
                model=MODEL,
                messages=messages,
                tools=tools,
                tool_choice="auto",
            ).choices[0].message

            # No tool calls => the model has finished; return its answer.
            if not reply.tool_calls:
                return reply.content or ""

            # Record the assistant turn, including its tool-call requests.
            messages.append(reply.model_dump(exclude_none=True))

            # Execute each requested tool and feed the result back.
            for call in reply.tool_calls:
                name = call.function.name
                args = json.loads(call.function.arguments or "{}")

                if name in SENSITIVE_TOOLS and not approve(name, args):
                    output = {"error": "Rejected by human reviewer."}
                else:
                    print(f'Performing tool call: {name} with args: {args}')
                    output = tool_output(await client.call_tool(name, args))
                    print(f'Tool call output: {output}')

                # Every tool_call MUST get a matching tool message back.
                messages.append({
                    "role": "tool",
                    "tool_call_id": call.id,
                    "content": json.dumps(output, default=str),
                })

        return "Stopped: reached the step limit before finishing."

if __name__ == "__main__":
    answer = asyncio.run(run(
        "Check the chemicals category for data-quality problems and flag anything broken."
    ))
    print("\n=== Agent answer ===\n" + answer)

The main loop in this program is the run()-function which corresponds with the agent loop diagram shown earlier. This loop is performing the following steps:

Process the reply from the LLM
Check if the LLM requests any tool calls. If not, the agent is finished
If tool calls are present, loop over the tools sequentially to call them
For each tool call, first check if human approval is necessary (as defined in variable SENSITIVE_TOOLS)
If the tool call is either approved by the human or didn’t require approval, call the tool and append its output to the message string

Let’s now have a look at the agent’s output:

Performing tool call: search_materials with args: {'category': 'chemicals'}
Tool call output: [{'id': 'M-1002', 'name': 'Industrial solvent 5L', 'category': 'chemicals', 'status': 'active'}, {'id': 'M-1005', 'name': 'Lithium battery pack', 'category': 'chemicals', 'status': 'active'}]
Performing tool call: get_material with args: {'material_id': 'M-1002'}
Tool call output: {'id': 'M-1002', 'name': 'Industrial solvent 5L', 'category': 'chemicals', 'unit': 'EA', 'weight_kg': 5.4, 'hazard_class': None, 'status': 'active'}
Performing tool call: get_material with args: {'material_id': 'M-1005'}
Tool call output: {'id': 'M-1005', 'name': 'Lithium battery pack', 'category': 'chemicals', 'unit': 'EA', 'weight_kg': 0.0, 'hazard_class': None, 'status': 'active'}

[approval] The agent wants to call `flag_material` with:
{
  "material_id": "M-1002",
  "issue": "Missing hazard_class for a chemical material (required). Please assign appropriate hazard class / attach SDS."
}
[approval] Approve? [y/N] y
Performing tool call: flag_material with args: {'material_id': 'M-1002', 'issue': 'Missing hazard_class for a chemical material (required). Please assign appropriate hazard class / attach SDS.'}
Tool call output: {'ok': True, 'id': 'M-1002', 'new_status': 'review', 'issue': 'Missing hazard_class for a chemical material (required). Please assign appropriate hazard class / attach SDS.'}

[approval] The agent wants to call `flag_material` with:
{
  "material_id": "M-1005",
  "issue": "Missing hazard_class and weight_kg is 0.0 (implausible). Weight must be > 0 and hazard class required for chemicals; please verify physical weight and assign hazard class / attach SDS."
}
[approval] Approve? [y/N] y
Performing tool call: flag_material with args: {'material_id': 'M-1005', 'issue': 'Missing hazard_class and weight_kg is 0.0 (implausible). Weight must be > 0 and hazard class required for chemicals; please verify physical weight and assign hazard class / attach SDS.'}
Tool call output: {'ok': True, 'id': 'M-1005', 'new_status': 'review', 'issue': 'Missing hazard_class and weight_kg is 0.0 (implausible). Weight must be > 0 and hazard class required for chemicals; please verify physical weight and assign hazard class / attach SDS.'}

=== Agent answer ===
I inspected the "chemicals" category and flagged two records for data-quality review. Summary of findings and actions:

1) M-1002 — Industrial solvent 5L
- Problems found: hazard_class is missing (null). For chemical materials the hazard class (and associated SDS) is required.
- Action taken: Flagged for review. Status set to "review" and issue recorded: "Missing hazard_class for a chemical material (required). Please assign appropriate hazard class / attach SDS."

2) M-1005 — Lithium battery pack
- Problems found: hazard_class is missing (null) and weight_kg = 0.0 (implausible — weight must be > 0).
- Action taken: Flagged for review. Status set to "review" and issue recorded: "Missing hazard_class and weight_kg is 0.0 (implausible). Weight must be > 0 and hazard class required for chemicals; please verify physical weight and assign hazard class / attach SDS."

If you want, I can:
- Provide recommended hazard classes for each item (based on name) to help the reviewer.
- Attach a suggested checklist of required fields for chemical records to prevent recurrence.
- Unflag if you prefer a different issue text.

Here you can see the steps the agent takes:

The agent calls search_materials() to get the materials from the chemical category and retrieves 2 materials
The agent does a get_material() for both of these materials to get the full details
The agent calls flag_material() for both of the materials (the user answers with “yes”)
The final output of the agent is shown after === Agent answer === and shows a summary from the agent of its own activities.

Remember that the agent does all of this without any other guidance than simply providing the prompt of its task and the tool definitions!

Agentic vs. fixed workflow approaches

As shown in the above example, the main difference between an agentic approach for a data quality checking problem and a programmed workflow in which we simply create a program to loop over all relevant materials and present these to an LLM to check for problems is the fact that we now offload the workflow to the agent. This means that we require the agent to decide for itself to query the correct material list from the source and inspect each of them. This might not be a good idea for a few reasons:

Because of the stochastic nature of the LLM it could be that the agent is incorrectly missing materials or accidentally queries materials from a wrong category. Without specifically tracing all tool calling steps of the agent, this could go undetected.
Because we have no control over the actual sequence of tool calls of the agent, this could lead to inefficient paths through the workflow, especially in scenarios where there are many more tools present and alternative ways to accomplish the same task are present. This could lead to higher token consumption than when simply building an LLM-calling workflow yourself.
We are adding latency to the tool calls because of the round trip to the LLM API, which is adding quite some overhead in situations where many tool calls are required in a loop.

So before jumping into an agent-first solution, think your problem through and first decide if you can live with a manually defined fixed workflow as well.

Wrapup

I hope this blog post takes away some of the perceived complexity and vagueness from the concept of an agent. Remember that we are still just working with an LLM under the hood which emits symbolic tool calls in the form of JSON snippets as part of its string generation process. That’s the point where we require a lightweight framework to perform the actual tool call and insert its output back into the string. That’s just it, all those frameworks you see popping up nowadays make it easier to build your final agent and integrate external tools with them, but are not materially different from what I have shown.

By taking the simple approach documented in this post instead of jumping onto a more complex agent framework first, you have some flexibility to quickly test and validate the concept of the agent you are building very fast and not get bogged down into a complex configuration framework.

agents MCP