Agentic Workflows Need State And Guardrails
Why useful LLM agents depend on tools, state machines, evals, and careful failure handling.
An LLM agent sounds simple: give the model tools, ask it to complete a task, and let it decide what to do next.
The demo version is magical. The production version is a distributed system with a probabilistic planner in the middle.
Useful agents need more than a prompt. They need state, tools, permissions, recovery paths, and evals.
A Tool Call Is Not An Agent#
Tool calling lets a model invoke functions:
{
"name": "search_docs",
"arguments": {
"query": "API key rotation runbook"
}
}jsonThat is useful, but it is only one step. An agentic workflow usually requires a loop:
- Understand the task
- Choose a tool
- Read the result
- Decide the next step
- Stop when done
The model is not just generating an answer. It is controlling a process.
That process needs boundaries.
State Is The Backbone#
Without explicit state, an agent has to infer progress from the conversation history. That works for toy tasks and fails for long workflows.
A better design keeps state in application code:
{
"task_id": "case_1831",
"goal": "Prepare refund decision",
"stage": "collecting_order_details",
"known_fields": {
"order_id": "ORD-8841",
"purchase_date": null
},
"completed_steps": ["identified_customer"],
"blocked_on": ["purchase_date"]
}jsonThe model sees the state, proposes the next action, and the application updates the state after validation.
Tools Need Contracts#
Each tool should have a narrow purpose, clear arguments, and predictable output.
Bad tool:
do_customer_stuff(input: string)textGood tools:
lookup_order(order_id: string)
create_refund_case(order_id: string, reason: string)
send_customer_email(case_id: string, template_id: string)textNarrow tools are easier to validate and safer to expose. They also make the model’s choice more interpretable.
Tool outputs should be structured too. If a tool returns a paragraph, the model has to parse it. If it returns JSON, the workflow can inspect it.
Permissions Should Be Tool-Specific#
Agents become risky when every tool is available in every situation.
A better design uses an authorization matrix:
| Stage | Allowed tools |
|---|---|
| Identify customer | search_customer, lookup_order |
| Investigate issue | search_docs, read_ticket_history |
| Prepare action | draft_email, create_case |
| Execute action | send_email, issue_refund |
The model can only choose from tools allowed in the current stage. Sensitive tools require stricter preconditions.
This avoids a common failure: the agent jumps from partial information directly to a side effect because that action looked useful. The workflow should make illegal states unrepresentable.
Planning Should Be Bounded#
Open-ended planning is where agents get expensive and strange. The model can search forever, repeat actions, or pursue a bad subgoal with confidence.
Set limits:
- Maximum tool calls
- Maximum runtime
- Allowed tools per stage
- Required fields before side effects
- Stop conditions
- Human escalation rules
For example, an agent should not be allowed to issue a refund until the workflow state contains an order ID, refund eligibility, amount, and approval status.
if missing(required_fields):
ask_follow_up_or_escalate()
else:
allow_refund_tool()textThe model can help decide what information is missing. The application should decide whether an action is allowed.
Idempotency Saves You From Expensive Mistakes#
Agents retry. Tools fail. Networks time out. The model may ask to perform the same action twice because it did not see the first result.
Any side-effecting tool should be designed with idempotency in mind.
{
"idempotency_key": "refund_case_1831_create",
"order_id": "ORD-8841",
"amount": 42.00
}jsonIf the same action is attempted twice with the same key, the system should return the existing result instead of creating a duplicate refund, ticket, or email.
This is not an LLM-specific idea. It is standard distributed-systems hygiene. Agents just make it more urgent because the planner is probabilistic.
Memory Is Not A Log#
Agent memory is often presented as a magic feature. In practice, you need to separate memory types.
- Conversation history: what the user and assistant said
- Workflow state: current task progress
- Long-term user facts: preferences or account details
- Audit log: what actions were taken and why
- Retrieved knowledge: external docs used for reasoning
Mixing these into one blob creates privacy, reliability, and debugging problems.
The audit log is especially important. If an agent sends an email, changes a record, or triggers a workflow, you need to know which model output caused it and which checks passed.
Error Handling Is Part Of The Prompt Contract#
Tools fail in ordinary ways:
- API timeout
- Permission denied
- Missing record
- Ambiguous search result
- Validation error
The model needs structured error information, not a stack trace pasted into the chat.
{
"tool": "lookup_order",
"status": "not_found",
"message": "No order matched ORD-8841",
"recoverable": true
}jsonThen the agent can choose a recovery path: ask the user to confirm the order ID, search by email, or escalate.
Without structured errors, agents often respond with vague apologies or repeat the same failing tool call.
Human Review Is A Feature#
Some tasks should not be fully automated. The agent should prepare work for a human rather than complete it alone.
Good human-in-the-loop patterns:
- Draft an email, but require approval to send
- Recommend a refund, but require review above a threshold
- Summarize a contract risk, but cite source clauses
- Prepare a database migration plan, but require engineer approval
This is not a failure of automation. It is how you safely apply automation to higher-value work.
Design For Partial Success#
Many useful agent workflows should not be all-or-nothing.
If the agent cannot complete the task, it can still leave the user or human reviewer in a better place:
- Gather the relevant documents
- Fill known fields
- Identify missing information
- Draft a recommended next step
- Explain why it stopped
This is especially important for business workflows. A fully automated success rate of 70% may sound mediocre, but if the remaining 30% arrive neatly packaged for review, the system can still save a lot of time.
Partial success requires state. The system needs to know what was completed, what failed, and what remains blocked.
Evaluating Agents#
Agent evals need to measure the whole trajectory, not just the final answer.
Questions to ask:
- Did the agent choose the right tools?
- Did it avoid unnecessary tool calls?
- Did it stop at the right time?
- Did it preserve required fields?
- Did it recover from tool errors?
- Did it avoid forbidden actions?
- Was the final result correct?
For many workflows, a trajectory with a correct final answer can still be bad if it called the wrong tool, leaked data, or took an unsafe action along the way.
Store traces. Review traces. Turn failures into evals.
It is also useful to score agents at multiple levels:
- Step score: was each tool call valid?
- Trajectory score: was the sequence efficient and safe?
- Outcome score: did the final task succeed?
- User score: was the interaction understandable?
A workflow can have a good outcome and a bad trajectory. For example, an agent might eventually solve the issue after six unnecessary searches and one forbidden lookup attempt. That should not pass silently.
A Practical Architecture#
A reliable agent architecture is usually less glamorous than the demo:
- A state machine owns workflow stages
- The model proposes the next action
- The application validates tool arguments
- Tools execute with permissions and logging
- The state machine updates progress
- Evals test common and adversarial paths
The model is powerful because it handles ambiguity inside the loop. The system is reliable because the loop is not fully owned by the model.
Observability Is Non-Negotiable#
Agent traces are product telemetry and debugging evidence.
Log:
- State before each model call
- Tools made available
- Tool the model selected
- Raw arguments
- Validation result
- Tool output
- State transition
- Final user-facing message
Without this trace, you cannot tell whether a failure came from the model’s plan, the tool result, the state machine, or the prompt. With it, you can turn a bad run into a targeted fix.
The trace also supports review. If an agent performs work on behalf of a user, the user should be able to see what happened in plain language.
The Takeaway#
Agentic workflows became useful when builders stopped treating agents as free-floating chatbots and started treating them as controlled processes.
The model can interpret, plan, and adapt. The application should own state, permissions, validation, and auditability.
That is the difference between an agent demo and an agent you can trust with real work.
The practical lesson is boring in the best way: useful agents look less like autonomous magic and more like well-instrumented workflow engines with a language model inside.