Your LLM Agent's Tool Calls Are Uncompensated Transactions

Akash Moorching, April 2026

Every agent framework sells you on the same promise: checkpoints, retries, resilience. LangGraph will resume from state. Temporal will retry your activity. But when your expense-approval agent fails at step 6, neither one refunds the card charge from step 4.

That charge is a side effect. It escaped. And your framework has no opinion about it, because frameworks manage execution flow, not effect semantics.

You already know this and solved it — or more accurately, you've already patched around it.

The code you're writing today

If you've built agent tooling at a company that moves money, somewhere in your codebase there's something like this:

typescript

const completedSteps: Array<{ tool: string; result: any }> = [];

try {
  const balance = await checkBalance(accountId);
  completedSteps.push({ tool: "check_balance", result: balance });

  const charge = await stripe.charges.create({ amount, source });
  completedSteps.push({ tool: "charge_card", result: charge });

  const email = await sendConfirmation(userEmail, charge.id);
  completedSteps.push({ tool: "send_email", result: email });

  const ledgerEntry = await updateLedger(charge.id, amount);
  completedSteps.push({ tool: "update_ledger", result: ledgerEntry });

} catch (err) {
  // TODO: handle partial failure
  for (const step of completedSteps.reverse()) {
    try {
      if (step.tool === "charge_card") {
        await stripe.refunds.create({ charge: step.result.id });
      }
      if (step.tool === "update_ledger") {
        await deleteLedgerEntry(step.result.id);
      }
      // send_email... uhh, can't unsend. Log it? Slack alert?
      // check_balance... doesn't matter, skip
    } catch (cleanupErr) {
      // if the cleanup fails... log and hope?
      console.error("Cleanup failed:", cleanupErr);
      await postToSlack(`⚠️ Manual intervention needed: ${step.tool}`);
    }
  }
}

This works for one workflow. Then you build a second agent with different tools and different cleanup logic, and you copy-paste the pattern. By the third, someone suggests extracting a helper. The helper grows. It handles some tools but not others. Someone adds a flag for “tools that can't be undone” and someone else adds a different flag for “tools that need approval before running.” There's no shared vocabulary for what these flags mean. Six months later, a new hire asks which tools are safe to retry and nobody can answer without reading the implementation of each one.

The problem is that ad hoc compensation gets messy. Every workflow re-derives the same questions from scratch — can this be undone? how? what if the undo fails? — and answers them inline, inconsistently, without an audit trail.

What changes with Unwind

Unwind is a middleware layer that sits between your agent and your tools. It does one thing: it forces you to classify each tool's side effects at definition time, and then enforces the compensation contract that classification implies.

Here's the same workflow:

typescript

const checkBalance = unwind.tool({
  name: "check_balance",
  effectClass: "idempotent",
  execute: async ({ account_id }) => getBalance(account_id),
});

const chargeCard = unwind.tool({
  name: "charge_card",
  effectClass: "reversible",
  execute: async (args) =>
    stripe.charges.create({
      amount: args.amount,
      source: args.source,
      idempotency_key: args.__idempotencyKey,
    }),
  compensate: async (_args, result) =>
    stripe.refunds.create({ charge: result.id }),
});

const sendEmail = unwind.tool({
  name: "send_email",
  effectClass: "append-only",
  execute: async (args) => email.send(args),
});

const deleteAccount = unwind.tool({
  name: "delete_account",
  effectClass: "destructive",
  execute: async (args) => db.accounts.delete(args.account_id),
  approvalGate: async (args) => requestHumanApproval(args),
});

Four effect classes. Four sets of rules.

Idempotent — no meaningful side effect. Safe to retry infinitely. Skipped during compensation because there's nothing to undo. checkBalance, getUser, lookupPrice.

Reversible — the effect can be mechanically undone, and the type system won't let you register a reversible tool without providing the compensate function. This is a compile-time guarantee, not a runtime hope. If TypeScript compiles, the rollback logic exists. chargeCard, createHold, transferFunds.

Append-only — it happened and it can't be recalled. An email was sent. A webhook was fired. A row was inserted into an immutable audit log. Unwind doesn't pretend it can undo these. Instead, it records exactly what escaped — the recipient, the payload, the timestamp — so an operator knows the full blast radius without grepping CloudWatch. sendEmail, postWebhook, writeAuditLog.

Destructive — irreversible and high-stakes enough that the agent shouldn't fire it without a human in the loop. Unwind gates execution behind an approval callback that you implement. The tool won't run until the gate clears. deleteAccount, revokeAccess, terminateSubscription.

These are compile-time contracts. A new engineer reads the tool definition and immediately knows: this tool is reversible, here's how it gets reversed, and the framework enforces that contract.

Compensation in practice

When the workflow fails, compensation is one call:

typescript

const summary = await unwind.compensate(runId);

Unwind walks completed steps in reverse and applies the strategy each effect class dictates. Here's what the operator sees:

Compensation summary for run_7f3a2b:
  ✓ Compensated:      charge_card → refunded (re_abc123)
  ⚠ Uncompensatable:  send_email to user@acme.com
                       subject: "Your card was charged"
                       sent at: 2025-04-15T14:23:07Z — irreversible
  ○ Skipped:          check_balance (idempotent, no side effect)
  ✗ Failed:           (none)

Every question you'd ask during an incident — what got rolled back? what's still out there? do I need to intervene? — is answered in four lines. The charge was refunded. The email can't be unsent, but you know exactly which email, to whom, at what time. The balance check was a no-op. Nothing failed to compensate.

Compare that to the alternative: a Slack alert that says ⚠️ Manual intervention needed: charge_card, then twenty minutes of an engineer checking Stripe, cross-referencing CloudWatch timestamps, and manually confirming whether the refund actually went through. The structured summary turns a twenty-minute fire drill into a five-second triage.

When compensation itself fails — the Stripe API is down, the refund call times out — that shows up in the summary too:

  ✗ Failed:           charge_card → refund attempt failed
                       error: stripe_api_timeout
                       original charge: ch_xyz789, $49.00
                       action required: manual refund

The event store retains enough context to retry the failed compensation later, or to hand it to a human with all the information they need to resolve it manually.

“I can write this in 50 lines”

Maybe. For one workflow with three tools, a custom wrapper is fine. But the abstraction breaks down in practice along three axes.

Shared tools across workflows. Your chargeCard tool appears in expense approval, subscription billing, and refund processing. Each workflow has different compensation semantics for the surrounding steps, but the charge tool's own compensation logic is always the same. Without a shared definition, you're duplicating that refund logic — or worse, implementing it slightly differently in each place. With Unwind, the tool is defined once. Every workflow that uses it inherits the same compensation contract.

Compensation ordering. When a workflow has six completed steps and step 7 fails, compensation needs to run in reverse order. If step 4's compensation depends on the result of step 5's compensation (releasing a hold only after reversing the transfer), the ordering matters and the hand-rolled version gets subtle. Unwind manages the dependency chain from the event store — it knows what ran, in what order, and replays compensation in the correct reverse sequence.

Auditability. An ad hoc try/catch block doesn't produce a queryable record of what was compensated, what wasn't, and why. When compliance asks “what happened to the charge on account X at 3:47pm on Tuesday,” you need an answer better than grep-ing CloudWatch logs. Unwind writes every tool call and compensation outcome to an append-only event store — SQLite locally, pluggable for production — with structured fields you can query directly.

The numbers tell the story even at small scale. A typical workflow with five tools and proper compensation runs 80–120 lines of bespoke cleanup — the try/catch block, the per-tool branching, the error handling for when cleanup itself fails, the logging. With Unwind, the compensation logic lives in the tool definition: 3–5 lines per tool, written once. A new workflow that reuses those tools adds zero compensation code. The cost of the hand-rolled approach scales linearly with the number of workflows. Unwind's scales with the number of unique tools.

How it fits your stack

Unwind wraps tool calls. It doesn't manage execution, scheduling, or state. You keep using whatever runs your agent today.

Raw Anthropic SDK

Convert tools and dispatch inside your message loop:

typescript

import { UnwindClient } from "@unwind/core";
import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();
const unwind = new UnwindClient({ store: "sqlite://unwind.db" });

// Register your tools (definitions from above)
const tools = [checkBalance, chargeCard, sendEmail];

const response = await client.messages.create({
  model: "claude-sonnet-4-20250514",
  tools: unwind.anthropicTools(tools),
  messages,
});

let step = 0;
for (const block of response.content.filter(b => b.type === "tool_use")) {
  const result = await unwind.handleToolUse(runId, step++, block, tools);
  // Feed result back into the conversation as a tool_result message
  messages.push({
    role: "user",
    content: [{
      type: "tool_result",
      tool_use_id: block.id,
      content: JSON.stringify(result),
    }],
  });
}

// On failure anywhere in the loop:
const summary = await unwind.compensate(runId);

unwind.anthropicTools() converts your tool definitions into the Anthropic tool schema format. unwind.handleToolUse() executes the tool, records the call and result in the event store, and returns the output. That's the entire integration surface — two functions.

LangGraph

Plug Unwind into LangGraph's tool node. LangGraph keeps managing state and checkpoints; Unwind handles what happens when a tool's effect needs to be reversed.

typescript

import { StateGraph } from "@langchain/langgraph";
import { UnwindClient } from "@unwind/core";

const unwind = new UnwindClient({ store: "sqlite://unwind.db" });

// Wrap your existing LangGraph tool functions
const toolNode = async (state) => {
  const toolCall = state.messages.at(-1).tool_calls[0];
  const unwindTool = unwind.resolve(toolCall.name);

  try {
    const result = await unwind.execute(
      state.runId, state.step++, unwindTool, toolCall.args
    );
    return {
      messages: [new ToolMessage({
        content: JSON.stringify(result),
        tool_call_id: toolCall.id,
      })],
    };
  } catch (err) {
    // LangGraph catches the error and routes to your failure node
    // Your failure node calls unwind.compensate(state.runId)
    throw err;
  }
};

const graph = new StateGraph({ channels: schema })
  .addNode("agent", agentNode)
  .addNode("tools", toolNode)
  .addNode("failure", async (state) => {
    const summary = await unwind.compensate(state.runId);
    return { compensationSummary: summary };
  })
  .addEdge("agent", "tools")
  .addConditionalEdges("tools", routeOnError, {
    success: "agent",
    error: "failure",
  });

LangGraph still handles checkpoints, branching, and re-entry. Unwind only touches the moment a tool's decision becomes a real-world side effect. If LangGraph resumes from a checkpoint, Unwind's event store knows which tools already ran and won't re-execute idempotent calls.

Temporal

Wrap tool execution inside Activities and call compensation from your workflow's failure handler:

typescript

import { proxyActivities } from "@temporalio/workflow";
import { UnwindClient } from "@unwind/core";

// Activities
export async function executeUnwindTool(
  runId: string, step: number, toolName: string, args: any
) {
  const unwind = new UnwindClient({ store: process.env.UNWIND_STORE_URL });
  const tool = unwind.resolve(toolName);
  return unwind.execute(runId, step, tool, args);
}

export async function compensateRun(runId: string) {
  const unwind = new UnwindClient({ store: process.env.UNWIND_STORE_URL });
  return unwind.compensate(runId);
}

// Workflow
const { executeUnwindTool, compensateRun } = proxyActivities<typeof activities>({
  startToCloseTimeout: "30s",
  retry: { maximumAttempts: 3 },
});

export async function expenseApprovalWorkflow(input: WorkflowInput) {
  const runId = workflowInfo().workflowId;
  let step = 0;

  try {
    await executeUnwindTool(
      runId, step++, "check_balance", { account_id: input.accountId }
    );
    await executeUnwindTool(
      runId, step++, "charge_card", { amount: input.amount, source: input.source }
    );
    await executeUnwindTool(
      runId, step++, "send_email", { to: input.userEmail, subject: "Charged" }
    );
  } catch {
    const summary = await compensateRun(runId);
    // Temporal records the summary in workflow history
    // Route to human task queue if summary contains failures
  }
}

Temporal still manages retries, timeouts, and workflow durability. Unwind handles the semantic question Temporal doesn't answer: when this activity succeeded but the workflow failed later, what do we do about the side effect that activity created?

Where this gets hard

Some tools aren't one class permanently. A payment hold is reversible for 7 days, then it settles and becomes append-only. A message is reversible until the recipient reads it. Unwind currently treats effect class as static per tool — you pick one at definition time. If your tool's reversibility is time-bounded, the practical workaround today is to classify it as reversible and have your compensate function check the window and throw if it's expired (which surfaces in the summary as a failed compensation with a clear reason). Conditional and temporal effect classes are on the roadmap, but not yet shipped.

Classification can also be ambiguous. Is a database upsert idempotent or reversible? Depends on whether the previous value matters. Is a Slack notification append-only or idempotent? Depends on whether a duplicate would confuse someone. Unwind forces you to pick, which is better than not thinking about it, but it doesn't resolve the judgment call for you. If you're unsure, then classify conservatively. Append-only is safer than idempotent, destructive is safer than reversible.

Before you install

Try this. List every tool your agent can call. Next to each one, write one of the four effect classes. You'll know within ten minutes where your compensation gaps are — the reversible tools with no rollback logic, the append-only effects nobody's tracking, the destructive calls with no approval gate. That audit is valuable whether or not you use Unwind. If the gaps worry you, the library is there.

Unwind is open source, MIT licensed, written in TypeScript.

bash

npm install @unwind/core

github.com/amoorching/unwind ↗

I hope you found this post interesting. Feel free to reach out to me on Twitter/X or email me at akash[dot]moorching[at]gmail[dot]com to discuss any thoughts :)