Agentic Workflows Need State And Guardrails • Rutvik Acharya

An LLM agent sounds simple: give the model tools, ask it to complete a task, and let it decide what to do next.

The demo version is magical. The production version is a distributed system with a probabilistic planner in the middle.

Useful agents need more than a prompt. They need state, tools, permissions, recovery paths, and evals.

A Tool Call Is Not An Agent#

Tool calling lets a model invoke functions:

{
  "name": "search_docs",
  "arguments": {
    "query": "API key rotation runbook"
  }
}

json

That is useful, but it is only one step. An agentic workflow usually requires a loop:

Understand the task
Choose a tool
Read the result
Decide the next step
Stop when done

The model is not just generating an answer. It is controlling a process.

That process needs boundaries.

State Is The Backbone#

Without explicit state, an agent has to infer progress from the conversation history. That works for toy tasks and fails for long workflows.

A better design keeps state in application code:

{
  "task_id": "case_1831",
  "goal": "Prepare refund decision",
  "stage": "collecting_order_details",
  "known_fields": {
    "order_id": "ORD-8841",
    "purchase_date": null
  },
  "completed_steps": ["identified_customer"],
  "blocked_on": ["purchase_date"]
}

json

The model sees the state, proposes the next action, and the application updates the state after validation.

Tools Need Contracts#

Each tool should have a narrow purpose, clear arguments, and predictable output.

Bad tool:

do_customer_stuff(input: string)

text

Good tools:

lookup_order(order_id: string)
create_refund_case(order_id: string, reason: string)
send_customer_email(case_id: string, template_id: string)

text

Narrow tools are easier to validate and safer to expose. They also make the model’s choice more interpretable.

Tool outputs should be structured too. If a tool returns a paragraph, the model has to parse it. If it returns JSON, the workflow can inspect it.

Permissions Should Be Tool-Specific#

Agents become risky when every tool is available in every situation.

A better design uses an authorization matrix:

Stage	Allowed tools
Identify customer	`search_customer`, `lookup_order`
Investigate issue	`search_docs`, `read_ticket_history`
Prepare action	`draft_email`, `create_case`
Execute action	`send_email`, `issue_refund`

The model can only choose from tools allowed in the current stage. Sensitive tools require stricter preconditions.

This avoids a common failure: the agent jumps from partial information directly to a side effect because that action looked useful. The workflow should make illegal states unrepresentable.

Planning Should Be Bounded#

Open-ended planning is where agents get expensive and strange. The model can search forever, repeat actions, or pursue a bad subgoal with confidence.

Set limits:

Maximum tool calls
Maximum runtime
Allowed tools per stage
Required fields before side effects
Stop conditions
Human escalation rules

For example, an agent should not be allowed to issue a refund until the workflow state contains an order ID, refund eligibility, amount, and approval status.

if missing(required_fields):
  ask_follow_up_or_escalate()
else:
  allow_refund_tool()

text

The model can help decide what information is missing. The application should decide whether an action is allowed.

Idempotency Saves You From Expensive Mistakes#

Agents retry. Tools fail. Networks time out. The model may ask to perform the same action twice because it did not see the first result.

Any side-effecting tool should be designed with idempotency in mind.

{
  "idempotency_key": "refund_case_1831_create",
  "order_id": "ORD-8841",
  "amount": 42.00
}

json

If the same action is attempted twice with the same key, the system should return the existing result instead of creating a duplicate refund, ticket, or email.

This is not an LLM-specific idea. It is standard distributed-systems hygiene. Agents just make it more urgent because the planner is probabilistic.

Memory Is Not A Log#

Agent memory is often presented as a magic feature. In practice, you need to separate memory types.

Conversation history: what the user and assistant said
Workflow state: current task progress
Long-term user facts: preferences or account details
Audit log: what actions were taken and why
Retrieved knowledge: external docs used for reasoning

Mixing these into one blob creates privacy, reliability, and debugging problems.

The audit log is especially important. If an agent sends an email, changes a record, or triggers a workflow, you need to know which model output caused it and which checks passed.

Error Handling Is Part Of The Prompt Contract#

Tools fail in ordinary ways:

API timeout
Permission denied
Missing record
Ambiguous search result
Validation error

The model needs structured error information, not a stack trace pasted into the chat.

{
  "tool": "lookup_order",
  "status": "not_found",
  "message": "No order matched ORD-8841",
  "recoverable": true
}

json

Then the agent can choose a recovery path: ask the user to confirm the order ID, search by email, or escalate.

Without structured errors, agents often respond with vague apologies or repeat the same failing tool call.

Human Review Is A Feature#

Some tasks should not be fully automated. The agent should prepare work for a human rather than complete it alone.

Good human-in-the-loop patterns:

Draft an email, but require approval to send
Recommend a refund, but require review above a threshold
Summarize a contract risk, but cite source clauses
Prepare a database migration plan, but require engineer approval

This is not a failure of automation. It is how you safely apply automation to higher-value work.

Design For Partial Success#

Many useful agent workflows should not be all-or-nothing.

If the agent cannot complete the task, it can still leave the user or human reviewer in a better place:

Gather the relevant documents
Fill known fields
Identify missing information
Draft a recommended next step
Explain why it stopped

This is especially important for business workflows. A fully automated success rate of 70% may sound mediocre, but if the remaining 30% arrive neatly packaged for review, the system can still save a lot of time.

Partial success requires state. The system needs to know what was completed, what failed, and what remains blocked.

Evaluating Agents#

Agent evals need to measure the whole trajectory, not just the final answer.

Questions to ask:

Did the agent choose the right tools?
Did it avoid unnecessary tool calls?
Did it stop at the right time?
Did it preserve required fields?
Did it recover from tool errors?
Did it avoid forbidden actions?
Was the final result correct?

For many workflows, a trajectory with a correct final answer can still be bad if it called the wrong tool, leaked data, or took an unsafe action along the way.

Store traces. Review traces. Turn failures into evals.

It is also useful to score agents at multiple levels:

Step score: was each tool call valid?
Trajectory score: was the sequence efficient and safe?
Outcome score: did the final task succeed?
User score: was the interaction understandable?

A workflow can have a good outcome and a bad trajectory. For example, an agent might eventually solve the issue after six unnecessary searches and one forbidden lookup attempt. That should not pass silently.

A Practical Architecture#

A reliable agent architecture is usually less glamorous than the demo:

A state machine owns workflow stages
The model proposes the next action
The application validates tool arguments
Tools execute with permissions and logging
The state machine updates progress
Evals test common and adversarial paths

The model is powerful because it handles ambiguity inside the loop. The system is reliable because the loop is not fully owned by the model.

Observability Is Non-Negotiable#

Agent traces are product telemetry and debugging evidence.

Log:

State before each model call
Tools made available
Tool the model selected
Raw arguments
Validation result
Tool output
State transition
Final user-facing message

Without this trace, you cannot tell whether a failure came from the model’s plan, the tool result, the state machine, or the prompt. With it, you can turn a bad run into a targeted fix.

The trace also supports review. If an agent performs work on behalf of a user, the user should be able to see what happened in plain language.

The Takeaway#

Agentic workflows became useful when builders stopped treating agents as free-floating chatbots and started treating them as controlled processes.

The model can interpret, plan, and adapt. The application should own state, permissions, validation, and auditability.

That is the difference between an agent demo and an agent you can trust with real work.

The practical lesson is boring in the best way: useful agents look less like autonomous magic and more like well-instrumented workflow engines with a language model inside.