How do function/tool calling and LLM agents work at a high level?

Tool calling extends the LLM's output space to include structured function invocations. The model emits a JSON object naming a tool and its arguments; the runtime executes the tool and feeds the result back as a new message. An agent is a loop that repeats this cycle — observe, think, act — until the task is complete or a stopping condition is met.

What is an AI agent, and how does it differ from a single LLM call?

An agent is an LLM placed in a loop where it reasons, chooses and calls tools or actions, observes the results, and repeats until a goal is met, rather than producing one response and stopping. The key differences are autonomy, tool use, memory and state, and multi-step control flow driven by the model's own decisions.

What is tool use or function calling in LLMs, and how do you design good tools for an agent?

Function calling lets an LLM output a structured request to invoke an external function with arguments, which the runtime executes and feeds back, enabling agents to act in the world. Good tool design uses clear names and descriptions, minimal well-typed parameters, narrow single-purpose scope, least privilege, and informative error messages so the model can choose and call them reliably.

What prompt engineering techniques should every LLM practitioner know?

The core toolkit is: system prompts (role and constraints), few-shot examples (format and tone anchoring), chain-of-thought (step-by-step reasoning), and output constraints (JSON schema, stop sequences). Combining these predictably closes the gap between a capable base model and a production-ready feature.

The Agent Harness — Agentic AI

Send a message to a language model and it responds. Send the same message again and it responds identically — it has no memory of the first call, no knowledge of what it did before, and no ability to act on anything. The model is a pure function: text in, text out. Nothing about it is agentic.

The thing that makes it an agent is the code around it. That code is the harness (from the engineering term for the scaffolding that drives and monitors a system) — the loop, the tool dispatch, the context manager, the stop conditions, the permissions layer. A well-built harness is invisible when things go right. It is the entire problem when things go wrong.

This lesson builds one from scratch.

The model is stateless. The harness is the loop.

A language model has no memory across calls. Every invocation starts from whatever you put in the context window — nothing more. If you want the model to remember what happened two steps ago, the harness has to put that history into the prompt. If you want the model to call a tool and then reason about the result, the harness has to run the tool and append the result to the context. This is why the harness exists: without it, every API call is independent, and you get a chatbot, not an agent.

The core agent loop is:

while not done:
    response = call_model(context)
    if response is a tool call:
        result = dispatch_tool(response.tool, response.args)
        context.append(response, result)
    else:
        return response.text   # final answer

That’s not pseudocode for the hard part — that’s the entire structural idea. Every framework (LangGraph, MAF, ADK) is just this loop with opinions about state management, persistence, and error handling.

The agent loop: call the model, and if it asks for a tool, act then observe and loop; otherwise return the answer. Every framework is this loop with opinions.

The six components

A minimal harness needs six things. Most of the engineering budget in a production agent goes into these.

1. The agent loop (think → act → observe)

The loop drives the model forward until it reaches a terminal state. Each iteration: call the model, inspect the output, if it’s a tool call run the tool and feed back the result, otherwise surface the answer.

The three steps you’ll see named everywhere — think, act, observe — map directly to a single loop iteration: the model thinks (generates a response), the harness acts (runs the tool), the model observes (the result is appended to context for the next call).

2. Tool dispatch and schema validation

Tools are functions registered with the harness. The model requests a tool by name and provides arguments as JSON. The harness is responsible for:

Schema validation — check that the arguments match the tool’s declared schema before calling anything. If they don’t, return a well-formed error back to the model so it can retry, not a Python traceback.
Dispatch — route the validated call to the right function.
Result serialisation — return results as strings (or structured text) the model can reason about.

Without schema validation you get cascading failures: a malformed argument propagates into the tool, the tool raises an exception, the harness panics, the run dies. Models do produce malformed tool calls, especially in edge cases.

3. Context management (windowing and compaction)

Context windows are finite. A naive harness that appends every observation forever eventually overflows. The harness must manage what goes into the context window, using one or more of:

Sliding window — keep only the last N messages.
Compaction / summarization — periodically ask the model to summarise earlier history and replace it with the summary.
Retrieval — store older observations externally and retrieve relevant chunks by embedding similarity when needed.

The tradeoff is always memory vs. fidelity: a tighter context window costs less and runs faster, but the model may forget an earlier observation it needs.

4. Permissions and sandboxing

Tools can do real things — write files, call APIs, send emails. The harness is the last line of defense before those side effects happen. A minimal permissions layer:

Maintains an allowlist of which tool/argument combinations are permitted for this run or this user.
Optionally requires human confirmation before irreversible actions (file deletion, sending a message).
Prevents tool calls that reach outside the declared scope of the task.

Skipping this is fine for demos. It is not fine for anything that touches real infrastructure.

5. Stop conditions and runaway guards

An agent loop without a stop condition can run forever — burning tokens, making tool calls, and producing nothing useful. Stop conditions include:

Max steps — a hard cap on loop iterations.
Max tokens — a budget cap across the whole run.
Repetition detection — if the model calls the same tool with the same arguments twice in a row, it is probably stuck.
Timeout — wall-clock time limit for the whole run.

A well-designed harness trips the cheapest guard first. Token counting is free; repetition detection costs a dictionary lookup; timeout is a timer. These should fire before an expensive final model call.

6. Verification

Not every loop should terminate at the model’s first “done”. For tasks where correctness matters — code execution, structured extraction, multi-step calculation — the harness can run a verification step after the model declares it is finished: execute the code and check it doesn’t throw, parse the output against a schema, run a secondary evaluator call.

This is the evaluator-optimizer pattern applied inside the harness. It costs one extra model call per run, and it catches a meaningful fraction of hallucinated or malformed final answers.

A minimal harness you can run

The code below implements all six components in plain Python — no frameworks, no network — with the model faked by a deterministic function so the loop is fully reproducible. Walk through each section and see how the pieces connect.

# Minimal agent harness — all six components, no external deps.
# The model is faked with a deterministic function (no network, no frameworks).

# ---------------------------------------------------------------------------
# 1. Tool registry (dispatch table + schema declaration)
# ---------------------------------------------------------------------------

def tool_add(a: int, b: int) -> str:
    return str(a + b)

def tool_search(query: str) -> str:
    # Fake search result
    return f"Top result for '{query}': 42 relevant documents found."

TOOLS = {
    "add":    {"fn": tool_add,    "params": {"a": int, "b": int}},
    "search": {"fn": tool_search, "params": {"query": str}},
}

# ---------------------------------------------------------------------------
# 2. Schema validation
# ---------------------------------------------------------------------------

def validate_and_call(name, args):
    """Validate args against declared schema, then dispatch."""
    if name not in TOOLS:
        return f"ERROR: unknown tool '{name}'"
    schema = TOOLS[name]["params"]
    coerced = {}
    for k, typ in schema.items():
        if k not in args:
            return f"ERROR: missing required arg '{k}' for tool '{name}'"
        try:
            coerced[k] = typ(args[k])
        except (ValueError, TypeError):
            return f"ERROR: arg '{k}' must be {typ.__name__}"
    return TOOLS[name]["fn"](**coerced)

# ---------------------------------------------------------------------------
# 3. Fake model (deterministic — simulates think -> decide -> tool call / done)
# ---------------------------------------------------------------------------

_CALL_SEQUENCE = [
    {"type": "tool_call", "tool": "search", "args": {"query": "best sorting algo"}},
    {"type": "tool_call", "tool": "add",    "args": {"a": 17, "b": 25}},
    {"type": "done",      "text": "Done. Search found results; 17+25=42."},
]

def call_model(context, step):
    """Fake model: returns a tool call or a final answer."""
    if step < len(_CALL_SEQUENCE):
        return _CALL_SEQUENCE[step]
    return {"type": "done", "text": "Ran out of scripted steps."}

# ---------------------------------------------------------------------------
# 4. Context manager (simple window — keep last N messages)
# ---------------------------------------------------------------------------

WINDOW = 10

def trim_context(ctx):
    return ctx[-WINDOW:]

# ---------------------------------------------------------------------------
# 5. Harness — the loop with all six components wired in
# ---------------------------------------------------------------------------

def run_agent(task, max_steps=8, max_tokens_approx=2000):
    context   = [{"role": "user", "content": task}]
    seen      = {}           # for repetition detection
    tokens    = len(task)    # rough token budget
    step      = 0

    print(f"Task: {task}\n{'='*50}")

    while step < max_steps:
        # --- runaway guard: token budget ---
        if tokens > max_tokens_approx:
            print(f"[STOP] token budget exceeded at step {step}")
            break

        response = call_model(context, step)

        if response["type"] == "done":
            # --- 6. verification (trivial check: non-empty answer) ---
            answer = response["text"].strip()
            if not answer:
                print("[VERIFY] empty final answer — treating as failure")
                break
            print(f"[FINAL] {answer}")
            return answer

        elif response["type"] == "tool_call":
            tool_name = response["tool"]
            tool_args = response["args"]

            # --- 4. permissions check ---
            # (here we allow all; in production check allowlist)
            permitted = True
            if not permitted:
                print(f"[DENY] tool '{tool_name}' not permitted")
                break

            # --- repetition detection ---
            call_key = (tool_name, str(sorted(tool_args.items())))
            if call_key in seen:
                print(f"[STOP] repeated call to {tool_name}{tool_args} — agent is stuck")
                break
            seen[call_key] = True

            # --- 2. validate + dispatch ---
            result = validate_and_call(tool_name, tool_args)
            print(f"[TOOL] {tool_name}({tool_args}) -> {result}")

            # --- 3. context management (append + window) ---
            context.append({"role": "assistant", "content": str(response)})
            context.append({"role": "tool",      "content": result})
            context = trim_context(context)
            tokens += len(result)

        step += 1

    print("[STOP] max steps reached without a final answer")
    return None

run_agent("Find the best sorting algorithm and add 17 + 25.")

Task: Find the best sorting algorithm and add 17 + 25.
==================================================
[TOOL] search({'query': 'best sorting algo'}) -> Top result for 'best sorting algo': 42 relevant documents found.
[TOOL] add({'a': 17, 'b': 25}) -> 42
[FINAL] Done. Search found results; 17+25=42.

Trace the output — each [TOOL] line is one dispatch cycle, and the [FINAL] line is the verification-cleared answer. Two experiments make the guards concrete: set max_steps=1 and the loop stops before the add step ever runs; remove the repetition-detection block and a model that repeats a tool call would loop instead of being caught.

What breaks without each component

Missing component	Failure mode
Context management	Context overflow; model forgets earlier observations
Schema validation	Malformed args propagate into tools; tracebacks in model context
Repetition detection	Agent loops on the same tool call indefinitely
Max-steps cap	Runaway spend — a stuck agent at 1000 steps costs real money
Permissions check	Model (or adversarial input) calls destructive tools unchecked
Verification	Hallucinated final answers accepted as correct

None of these are hypothetical. They are the six failure categories you will see in the first production incident for any agent system that skips them.

In one breath

A language model is a pure function — text in, text out, stateless; the harness is all the code around it that turns it into an agent.
The core is the agent loop: call the model → if it’s a tool call, dispatch and append the result → else return the answer — think, act, observe, repeated.
A minimal harness needs six components: the loop, tool dispatch + schema validation, context management (window/compaction/retrieval), permissions/sandboxing, stop conditions (max steps/tokens, repetition, timeout), and verification.
The harness is the security boundary — the only component running with real credentials; the model can only request actions, so permission and runaway guards must live here.
Every framework (LangGraph, MAF, ADK) is this same loop with opinions on state, persistence, and hosting — and each missing component maps to a specific first-incident failure mode.

Quick check

0/3

Q1A language model is described as 'stateless'. What does this mean for the harness?

Q2Which harness component is responsible for catching the case where a model requests the same tool call twice in a row with identical arguments?

Q3Why does schema validation belong in the harness rather than inside each tool function?

The next lessons put this harness to work inside LangGraph — where the loop becomes an explicit state machine and the context becomes a typed state dict you can inspect, pause, and resume.

The Agent Harness

What you'll learn

Before you start

The model is stateless. The harness is the loop.

The six components

1. The agent loop (think → act → observe)

2. Tool dispatch and schema validation

3. Context management (windowing and compaction)

4. Permissions and sandboxing

5. Stop conditions and runaway guards

6. Verification

A minimal harness you can run

What breaks without each component

In one breath

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further