A reasonable person, looking at what DeepPolicy does, will ask why the pipeline has ten stages instead of one. After all, large language models can now hold entire budgets in context, reason across them, and produce coherent summaries. Why bother with the orchestration?
It is a fair question, and the answer is more interesting than “because we like architecture diagrams.”
What a single large prompt actually produces
Take a 122-page national budget, paste it into a single context window, and ask a frontier model for a structured fiscal analysis. The output is, let’s be honest, impressive. It identifies major spending categories. It picks out a few notable line items. It produces a fluent narrative summary. If you are using it for a board briefing, it is plausibly enough.
Now run the same prompt twice and compare the outputs. The two runs agree on the broad strokes and disagree on roughly thirty percent of the specifics. They emphasise different programmes. They reach different aggregate figures. They cite different lines as evidence — sometimes for the same claim.
This is not because the model is bad. It is because the task is structurally underspecified. “Analyse this budget” is a brief that an experienced research director would never give to a single analyst, because they know what would come back: a competent summary that papers over the parts the analyst found hardest, with no audit trail back to the source.
What the experienced research director does instead is exactly what the multi-agent pipeline does: split the document, assign specialists, define output schemas, aggregate deterministically, and edit the result.
What changes when you specialise the agents
The shift from a single prompt to a pipeline of specialised agents is not primarily about quality of any individual output. It is about consistency and verifiability.
When an analyst agent is briefed with one chapter, one analytical framework, and one structured output schema, three things become true at once.
First, the agent has no opportunity to silently drop the chapter it found difficult, because there are no other chapters in its context to switch to. It must produce a defined output for the chapter it was given.
Second, the structured output is comparable across chapters by construction. Chapter II’s output and Chapter VII’s output have the same schema, the same field names, the same classification taxonomy. Aggregation becomes a deterministic operation rather than an act of interpretation.
Third, the chain of attribution is preserved. Every figure in the chapter analysis is tied to a specific source row. When the editor agent compiles the master whitepaper, it inherits this attribution chain. When the user reads the final document, every claim traces back through the pipeline to the originating line.
A single large prompt cannot give you any of this. It can give you a fluent summary, but the relationship between the summary’s claims and the source document is — in a precise technical sense — unverifiable.
Where the deterministic layer lives
One of the most underrated design decisions in the pipeline is what the agents do not do.
Aggregation is not done by an agent. Once the per-chapter JSON outputs exist, summing them, cross-checking them against the source totals, and reconciling them is a hundred lines of Python. There is no upside to running this through a language model. There is significant downside: a model can produce a number that looks correct, has the right order of magnitude, and is wrong in the third significant digit, and you will not catch it without going line by line.
So the pipeline draws the line explicitly. Models reason about classifications, write narrative, and produce structured outputs. Code does arithmetic, validates totals, and produces charts. Neither tries to do the other’s job. This is not a sophisticated insight, but it is one that almost every “put the document into the LLM” workflow gets wrong.
The cost of orchestration
The honest objection to the pipeline approach is that it is operationally heavier. You need to define schemas, manage parallelism, handle partial failures, decide what to do when the eleventh of forty agents produces malformed output. A single prompt, by contrast, just runs.
This is true, and it is the reason most generalist tools have not adopted the pipeline shape. For a generalist tool serving generalist users, the operational complexity is not worth it.
For a tool whose buyers will be reading the output in front of a board of trustees, defending each number on a panel, and citing the analysis in a public submission to a parliamentary committee, the calculus is different. The single prompt is faster to ship and impossible to defend. The pipeline is more work to build and produces something that survives scrutiny.
What this means in practice
When you look at the deliverables from a DeepPolicy run, the things that matter — the per-line citation index, the reconciled aggregates, the demographic briefs that all reference the same underlying figures — are not features that were added on top of an LLM workflow. They are direct consequences of the pipeline shape. They are not available as a configuration option in a single-prompt tool, no matter how large the model is.
This is, in the end, why the architecture looks the way it does. Not because we wanted ten stages, but because the deliverable that institutional buyers actually need cannot be produced any other way.