Keeping Orchestration in Your Code, Not Your Prompt

We spent two days this week in Bedrock architecture workshops with AWS engineers — tool use, guardrails, embeddings, memory, multi-agent patterns. The biggest practical insight wasn't about a specific service. It was about where to put the coordination logic.

The easy version of tool use

When you first wire up an AI assistant with tool access, there's a compelling path of least resistance: give the model a set of tools, describe what they do, and let it figure out the sequencing. Call the search tool to find options, call the booking tool to select one, call the session tool if you need to track state. The model is genuinely good at this in isolation.

The problem is "in isolation." On a real booking platform, each service has its own validation rules, its own error modes, its own expectations about what came before. A fare search result has an expiry window. A quote operation needs specific metadata from the preceding search. Session identifiers need to flow through in a particular header. These aren't things you want the model deciding about on the fly — they're invariants the system depends on.

It works most of the time. But "most of the time" has a different operational meaning when you're coordinating services that touch financial transactions.

Orchestration belongs in your code

The position we landed on: orchestration logic lives in the application layer. The model's job is to understand user intent and produce structured output. Deciding which services to call, in what order, with what preconditions — that's engineering work that belongs in code, not in a system prompt hoping the model reasons its way to the right sequence.

There are a few reasons this matters beyond the obvious.

Testability. Application-layer orchestration is just code. You can unit test it, integration test it, reason about edge cases before they hit production. Prompt-driven orchestration is probabilistic — you can observe it, you can improve it, but you can't prove it.

Debuggability. When something goes wrong with application-layer orchestration, you have a stack trace. When the model takes a wrong turn through your tools, you have a conversation log and a hypothesis. These are not equivalent debugging experiences.

Operational confidence. The people who run these systems — and who get paged when they break — need to be able to reason about what's happening. Keeping coordination logic in code means it's reviewable, auditable, and consistent. Engineers who didn't write it can still follow it.

None of this means the model is reduced to a dumb text processor. It's still doing the genuinely hard work: understanding intent, handling ambiguity, dealing with the mess of natural language, producing structured output in the format downstream services expect. That's where the intelligence should be applied. The sequencing is deterministic by design.

Model abstraction

There's a related decision that becomes expensive to get wrong: how you reference the model in your code.

If you name a specific model version in twelve places — config files, service constructors, test fixtures — you're setting yourself up for a refactoring exercise every time a better or cheaper model becomes available. On Bedrock, inference profiles let you abstract this properly: instead of hardcoding a specific model identifier, you reference a profile that maps to "the current model for this use case," and update the mapping in one place when things change.

This sounds like obvious engineering hygiene. It usually is. But it's the kind of thing that slips during a prototype phase and doesn't get fixed until someone is midway through a migration counting how many files they need to touch.

The deeper point is that model selection should be a configuration concern, not a code concern. The code should say what kind of inference it needs; what model actually handles that inference should be externally configurable.

Skills and memory over monolithic prompts

The other architectural shift worth recording: as agent capability grows, a single enormous system prompt becomes a liability.

It's hard to update — a change to one capability risks affecting another, and you can't always predict the interactions. It's hard to test in isolation. And it tends to accumulate over time, with each new requirement getting bolted on, until you have something that works but nobody fully understands.

The better direction is modular: discrete Skills for different capabilities, explicit memory management instead of hoping the model retains relevant context from earlier in a long conversation. Each skill can be reasoned about independently, tested in isolation, updated without worrying about unintended side-effects from capabilities that share the same prompt.

This is where AWS's Strands SDK and AgentCore are pointing: an architecture where agent capabilities are composable rather than monolithic. It's closer to how you'd design a conventional service — small, cohesive pieces with clear interfaces — and further from the "give the model everything and see what happens" approach.

What changes

A week spent on architecture rather than shipping features doesn't always feel productive. The useful test is whether it changes how future work gets done.

In this case: when we extend the AI assistant — new capabilities, new service integrations — there's now a clearer pattern to follow. Orchestration in code. Model as intent-resolver and structured-output producer. Capabilities as discrete, testable Skills. Model selection behind an abstraction layer that makes swaps cheap.

None of that is especially dramatic. But getting these things right before the system is large is significantly easier than sorting them out after. Architecture decisions made at the beginning tend to have long half-lives, for better or worse.