What makes an agent "multi-step"
Multi-step agents are Claude-powered systems that chain two or more distinct tasks together, with the output of each step feeding the next, and minimal human input required between steps. The agent might search for information, process it, write a result, and send it — all in one run.
The key design challenge is not the individual steps. It is the handoffs between them. A bug in step two corrupts every step that follows. An unexpected output format in step three breaks the parser in step four. The checklist below targets these handoff points.
The pre-ship checklist
Work through this list in order. Each item represents a failure mode observed in real agent deployments.
Prompt isolation
- [ ] Each step has its own system prompt. No step inherits instructions from another.
- [ ] The system prompt for each step specifies exactly what format the output should use.
- [ ] You have tested each step independently with edge-case inputs before connecting them.
Output validation
- [ ] You validate the structure of each step's output before passing it to the next step.
- [ ] If a step returns unexpected output, the agent stops rather than continuing with bad data.
- [ ] You have a test case where step N produces malformed output and verified the agent halts cleanly.
Error handling
def run_agent_step(prompt: str, step_name: str) -> str:
try:
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=2048,
messages=[{"role": "user", "content": prompt}]
)
output = response.content[0].text
if not output or len(output.strip()) < 10:
raise ValueError(f"Step '{step_name}' returned unexpectedly short output.")
return output
except Exception as e:
print(f"Agent halted at step '{step_name}': {e}")
raise
- [ ] Every step call is wrapped in error handling.
- [ ] Errors are logged with the step name, input length, and timestamp.
- [ ] The agent does not silently swallow errors and continue.
Cost controls
- [ ] You have set
max_tokenson every step. There is no unbounded call. - [ ] You have estimated the total token cost per agent run at average input size.
- [ ] If the agent runs on a schedule, you have calculated monthly cost at expected volume.
Logging
- [ ] Every step logs: step name, model, input token count, output token count, latency.
- [ ] Logs are written to a persistent store, not just stdout.
- [ ] You can replay a failed run from logs without re-running the whole agent.
One step to take right now
Pick the agent you are closest to shipping and run it through the checklist. Mark every item you can honestly check off today. The gaps you find are your actual shipping blockers — not polish, not features. Fix those gaps, then ship.