1. Premise
An agent given a task like "extract every pattern matching X from this corpus," "modify all files that import Y," or "flag every problematic statement in this paragraph" is being asked for an extensional answer — one that is correct only if it covers the full set. Large Language Models, as inference engines, are not built for extensionality. They are built for saliency: the most striking, the most representative, the most prompt-shaped few items. The gap between what we ask of agents and what the underlying model is structurally biased to deliver is the central pressure described here.
The thesis: any agent that wants to act extensionally must, at some point, leave the model and run code. Not because code is more elegant, but because the cost surface of the transformer makes extensionality economically irrational inside a single completion. This piece traces the mechanism (§2), the workaround the field has converged on (§3), the cheap escape hatch agents already use (§4), and a generalized operator language built atop the Structured Agency class system as the principled version of that escape hatch (§5).
2. Saliency vs. Extensionality
Three independent forces — preference-data length bias, attention dilution on long inputs, and the asymmetric economics of decode — push generation toward short, salience-weighted outputs, and underneath them sits a deeper architectural mismatch: enumeration is a set-construction task being performed by a sequence predictor. They compose. Each is enough on its own to discourage a model from enumerating a full set; together they make extensionality something the model will attempt when explicitly asked, but often fail at — silently truncating, hedging with "and others," fabricating plausible-shaped items to round out the list, or simply stopping early. The failure is not always visible to the agent that called it.
2.1 The Output-Length Bias
RLHF reward models systematically reward longer responses on aesthetic grounds [1], but the more interesting bias for our purposes is the opposite tail: in operator tasks (extract, list, classify, modify), models trained on instruction-tuning preference data tend toward terse, summary-shaped completions over enumerative ones. Length-controlled win-rate evaluations have shown that judges prefer verbosity when the answer is a single argument [1], but an explicit list of forty matches is a different output shape — one whose marginal items contribute little additional reward and substantial additional risk of error. The model's prior is to compress.
The practical consequence: ask "find all X in this document" and the typical generation produces a representative sample plus an ellipsis or a hedge ("among others," "for example"). The completion shape is the one rewarded during training; enumeration is not.
2.2 Attention Dilution on Long Inputs
As input length grows, attention is zero-sum across an increasing number of tokens, and the share allocated to any single evidence token shrinks. Recent work names this score dilution and treats it as a fundamental limitation of the architecture, distinct from positional bias [2]. The companion finding — that context length alone degrades performance even when retrieval is perfect [3] — rules out the convenient explanation that long-context failures are just bad retrieval.
The original Lost in the Middle paper [4] showed the U-shaped attention bias: tokens at the start and end of a long input dominate, and material in the middle is systematically underweighted. Found in the Middle [5] pinpointed this as an intrinsic positional bias in the attention distribution itself. The aggregate effect is that as a prompt grows, the model increasingly attends to salient regions — boundaries, repeated patterns, surface markers — and underweights uniform middle content. Extensional tasks, by definition, require uniform attention across the full input. Score dilution is the architectural reason that requirement is hard to meet.
2.3 Decode and KV-Cache Economics
Even if the model wanted to enumerate the full extension, the cost surface punishes it. Inference splits into a compute-bound prefill phase (the input is processed in parallel) and a memory-bandwidth-bound decode phase (tokens are emitted one at a time, each step reading the entire KV cache to compute attention) [6]. Output tokens are 3–10× more expensive than input tokens at every major API [6][7] precisely because of this asymmetry: prefill amortizes across tokens, decode does not. Cached input is cheaper still — typically an order of magnitude — when the prefix can be reused [7].
This produces a sharp economic preference for short outputs over long ones and for repeated calls with cached prefixes over single long completions. An extensional generation — say, a thousand-line list of every match in a corpus — pays the maximum decode bill. The same task, decomposed into a fan-out of short calls or a single tool call that returns a structured result without decoding, pays a small fraction of it.
2.4 Enumeration vs. Sequence Prediction
Beneath the three pressures sits an architectural mismatch. Enumeration is a set-construction task — complete coverage, no duplicates, a principled stopping condition — but a decoder-only transformer is a sequence predictor. There is no native primitive for "members covered so far," "members still outstanding," or "set membership without duplication." Each of those properties has to be reconstructed at every decode step from the literal prefix that has already been emitted into the context window.
Two consequences follow. The dilution that hurts long inputs (§2.2) recurs on the output side as the enumeration grows: the longer the list emitted, the more attention is spent on the tail of the model's own generation rather than on the source material it is meant to be enumerating. And stopping is itself a token decision, governed by the same priors as every other token. The architecture has no internal completeness predicate that fires when "every member has been covered" — the end of the enumeration is whichever closing token wins the next decode under the brevity, attention, and cost pressures from §2.1–§2.3.
Neutralize the other three forces — train the model to prefer long outputs, fix attention dilution, make decode free — and the mismatch is still there. A sequence model can simulate set construction over short, well-bounded extensions, but the simulation degrades with cardinality, and the architecture provides no signal when it has.
2.5 The Resulting Failure Mode
Stack the four pressures and a coherent failure mode emerges. The model:
- Prefers short, summary-shaped outputs (training prior).
- Underweights middle-of-context evidence as inputs grow (attention dilution).
- Pays a steep marginal cost per emitted token (inference economics).
- Has no native set-construction primitive on which to ground completeness (architectural mismatch).
Asked for an extensional answer, it tries. It begins enumerating, but the same forces that discouraged the attempt now degrade it mid-completion: items in the middle of the input are underweighted, the marginal token gets more expensive as the context grows, the trained prior nudges toward wrap-up, and there is no internal predicate that knows when "all" has been reached. The result is a list that looks extensional — confidently delivered, plausibly shaped — but is silently incomplete or partially fabricated. The agent calling the model has no signal that anything was missed. This is the worst kind of failure for almost any production agent task that has the words all, every, or each in its specification.
3. The Multi-Turn Workaround
3.1 The Refactor Example
Concrete case. An agent is asked: "refactor every function in this codebase that calls parseDate so it accepts a timezone argument." The current dominant pattern is a two-phase loop:
- Discover. Use
grep/rgor a code-search tool to enumerate the call sites. The output is structured (a list of file:line matches) and cheap to produce — the model never has to decode it. - Operate. Iterate over the discovered list, opening each file in turn, generating the edit, and applying it. One file per turn, sometimes one function per turn.
The discovery step works because it is delegated to a deterministic tool. The model never tries to enumerate call sites from memory. The operate step works because each turn's input is small enough to keep attention concentrated on the file under edit. Extensionality is recovered, but only by paying for it in serial turns.
3.2 Scratchpads and Task Lists
More sophisticated agents externalize the iteration state into a scratchpad — a checked task list, a file of pending TODOs, an explicit "remaining" set. This is the move from implicit to explicit orchestration. It helps because it converts a property the model is bad at (remembering across many turns where it is in a long enumeration) into a property a tool is good at (maintaining a list). The agent's contribution per turn shrinks to: read the next item, perform a small unit of work, mark it done.
The scratchpad is doing the same job as the discovery step in §3.1, but for the plan rather than the data: it externalizes extension to a substrate the model can read from instead of generate.
3.3 The Orchestration Tax
The serial loop is correct but slow. Every turn pays:
- A round-trip latency cost (network + queueing + scheduling).
- A planning-token cost (each turn the model re-reads the task, re-orients, decides what to do).
- A context-management cost (older turns are summarized or evicted, sometimes with information loss).
- A drift cost — the longer the loop, the higher the chance the agent forgets a constraint, double-edits a file, or skips an item.
For a hundred-call sites this is annoying; for ten thousand, it is unworkable. The orchestration tax scales linearly with extension, which is exactly the wrong scaling.
4. The Coding Shortcut
4.1 The Rename Example
Watch a competent agent handle "rename function fooBar to foo_bar across the repository" and a different pattern often appears: the agent does not iterate. It writes a small script — a sed/perl/AST-rewrite invocation — that finds every appearance and applies the change in one execution. Then it runs the script, checks the diff, and reports.
This is both laziness and being smart. The agent has noticed that the operation is expressible as a pure function from source-state to source-state, and has chosen to encode the entire output as a program rather than emit it as decoded tokens.
4.2 Why It Works
The script is a closed-form representative for the entire extension. The model only has to decode the generator of the answer, not the answer itself — the answer is whatever rows of edits the script produces when it runs against the actual source. This sidesteps every pressure from §2 simultaneously:
- No output-length cost. The decoded artifact is a few dozen lines; the realized effect is unbounded.
- No attention dilution. The model never had to attend to the corpus in the first place.
- No orchestration tax. The execution is deterministic and runs in a single subprocess, not a turn loop.
Coding is not a stylistic choice the agent makes. It is the only known mechanism for delivering extensional answers without paying extensional decode costs.
4.3 Where It Breaks
The shortcut works when the operation is uniform — a single transformation applied identically to every member of the set. It breaks the moment the operation is heterogeneous: refactor each call site, but each one needs a different argument depending on what the surrounding function does. A regex cannot make that decision. A naive script either oversimplifies (treating heterogeneous cases as homogeneous, producing broken code) or balloons into a special-case nest that no longer fits in the model's working window.
This is the pinch point. Pure code scales to large extensions but only for shallow per-item decisions. Pure model calls scale to deep per-item decisions but only for small extensions. Most real agent tasks live in the middle: large extension, non-trivial per-item judgment. Neither side of the existing toolkit reaches that quadrant cleanly.
5. A Generalized Operator Language
The escape from §4.3 is to keep the programmatic skeleton — a script-shaped orchestration the model emits once — but allow individual operations inside that skeleton to be model calls rather than primitive transforms. The skeleton stays cheap and verifiable; the per-item judgment goes where it belongs, in a contained subagent invocation. The remainder of this section sketches the vocabulary that makes such skeletons composable.
5.1 Selectors, Cursors, Collections
Before lifting operators onto typed classes, we need a vocabulary for the sets those operators act on. The proposal uses three primitives, in the spirit of set-builder notation and database cursors:
- Selector — a declarative description of a set: "all tasks such that the status is open and the priority is at least 3." A Selector is a typed predicate against a typed world. It has no cardinality, no materialized members, and no I/O cost. It is just the question.
- Cursor — a handle on the materialization of a Selector. Discovering a Selector against the world produces a Cursor: a lazy, possibly bounded, possibly streaming iterator over members. A Cursor knows how to fetch in batches, page through external systems, and (optionally) stay open as the underlying set changes.
- Collection — a resolved set of typed members. Collections are where the algebra lives:
map,filter,reduce,flat_map,partition,group_by,union,intersect,diff,zip. Each operator has fixed cardinality semantics (mappreserves cardinality;filteris non-increasing;reducecollapses to a value).
The shape of the pipeline:
Selector → .discover() → Cursor → .collect() → Collection
│
▼
map / filter / reduce / partition / ...
The agent never enumerates members in tokens. It emits a Selector (the question) and an operator pipeline (what to do with the answer). The runtime is responsible for materializing, parallelizing, caching, and streaming.
# 1. Pose a question. No I/O, no materialization, no cost.
open_tasks: Selector[Task] = (
World.tasks
.where(lambda t: t.status == "open")
.where(lambda t: t.priority >= 3)
)
# 2. Discover. The Cursor is now a handle on the (possibly large,
# possibly streaming) result set. Still no full materialization.
cursor: Cursor[Task] = open_tasks.discover()
# 3. Collect when the algebra needs the whole set; stream through
# the Cursor when it doesn't.
discovered: Collection[Task] = cursor.collect()
The separation matters for three reasons:
- A Selector is auditable as a question. A reviewer can read it and decide whether the question is the right one before any inference fires.
- A Cursor lets the runtime decide how to materialize. It can page, parallelize, stream, or skip materialization entirely if the downstream operator is associative.
- A Collection is the only place the algebra runs. A
mapover a Collection always covers every member, full stop. There is no "the model summarized the tail."
In set-builder notation, a Selector is { x ∈ World : P(x) }; a Cursor is the iterator that lazily walks that set; a Collection is the realized set on which the algebra is closed. The agent writes the comprehension; the runtime walks it.
5.2 Programmatic and Agentic Operators
The pseudo-Python in §5.1 makes it look as though every predicate and every per-element function is a programmatic callable — a Python lambda the runtime invokes deterministically. That is the default, but it is not the only mode, and the alternative is the move that distinguishes this operator language from a traditional query algebra. Each predicate, transformer, or aggregator can be either programmatic or agentic, and the algebra is closed under either choice.
- Programmatic mode. A callable:
lambda t: t.priority >= 3,def deterministic_rewrite(site): .... Cheap, fast, fully auditable, runs in process. The right default whenever the per-element decision can be expressed as code. - Agentic mode. A textual instruction:
"the user seems frustrated in the comments","rewrite each call site to use process_record instead of process, preserving its original behavior". The runtime evaluates the instruction per-element via a subagent call, returning the same typed shape the programmatic counterpart would have returned. The right choice when the per-element decision cannot be cleanly expressed as code.
# A Selector with mixed predicates. The cheap programmatic filter narrows
# the candidate set first; the agentic predicate handles what code can't.
frustrated_open_tasks: Selector[Task] = (
World.tasks
.where(lambda t: t.status == "open") # programmatic
.where("the user seems frustrated in the comments") # agentic
)
# A map with an agentic transformer. Cardinality and coverage guarantees
# of `map` are unchanged; only the per-element semantics differ.
edits: Collection[Edit] = discovered.map(
"rewrite each call site to use process_record, "
"preserving its original behavior"
)
Two properties of this polymorphism matter:
- An agentic predicate is sugar for an agentic
mapfollowed byfilter.Selector.where(text)desugars to a per-element classifier call producing booleans, followed by a filter on the result. Promoting it to a first-class predicate keeps the algebra closed and the pipeline auditable; the alternative is a sprawl of bespoke per-element classifier scripts. This is precisely the case the user describes when there is no easy programmatic way to express the selector — describe the condition, and a per-item interpretation deduces membership. - Programmatic narrows; agentic interprets. The discipline for cost is to push cheap programmatic filters as early in the pipeline as possible — they shrink the candidate set the agentic step has to interpret. An agentic predicate over the full corpus is expensive; an agentic predicate after a programmatic prefilter usually is not.
This polymorphism is the contribution of the operator language relative to LINQ-style query algebras (which assume predicates are programmatic) and relative to free-form prompt-based agents (which have no algebra at all). The operator language keeps the algebra and lets predicates, transformers, and aggregators be either programmatic or agentic — chosen per operator, not per pipeline.
5.3 Operators Over Structured-Agency Classes
Structured Agency defines typed classes for the objects an agent manipulates — Queries, Evidence, Plans, Edits, Claims, Tasks. With the Selector / Cursor / Collection plumbing from §5.1, the standard collection algebra lifts cleanly onto these classes. Each operator's structural meaning is fixed; the per-element function can be a model call, a tool call, or a deterministic transform.
The heterogeneous refactor from §4.3 — "rename process to process_record, but only where it is the data-pipeline function and not the unrelated request handler" — becomes:
# Pose the question declaratively.
sites: Selector[CallSite] = (
World.code
.of_type(CallSite)
.where(lambda s: s.callee.name == "process")
)
discovered: Collection[CallSite] = sites.discover().collect()
# Per-site judgment is delegated to a subagent. The cardinality of
# `edits` matches `discovered` exactly — the orchestration guarantees
# coverage of the extension.
edits: Collection[Edit] = discovered.map(
lambda s: agent.decide_refactor(
s,
context=s.surrounding_function,
rename_to="process_record",
)
)
# Set algebra over the results.
risky, safe = edits.partition(lambda e: e.confidence < 0.8)
to_review: Collection[Edit] = risky.union(
edits.filter(lambda e: e.touches_public_api)
)
by_module: dict[str, Collection[Edit]] = safe.group_by(lambda e: e.module)
total_loc: int = edits.reduce(lambda acc, e: acc + e.lines_changed, initial=0)
# Side effects materialize when a Collection is dispatched to a write tool — see §5.4.
safe.apply(write_to_disk_tool)
to_review.apply(human_review_tool)
Three properties of this style:
- The orchestration is programmatic. A static read of the pipeline is sufficient to confirm that every member of the Collection was visited.
- The judgment is contained. Each
mapinvocation is a subagent call with only the local context it needs. The model never decodes the full list; the script never needs to encode the per-site decision. - The middle quadrant is reachable. Tasks that are extensional in coverage but judgmental per element — the gap pure scripts and pure model calls each fail to span — collapse to a one-line
map.
The .apply(tool) calls in the example are not a side-effect primitive; they are typed dispatch over the existing Structured Agency tool-calling machinery. The next subsection makes that precise.
5.4 Apply: Tool Dispatch Over Typed Collections
apply is the operator the rest of §5 has been pointing toward, and it is not new infrastructure. Structured Agency §3.3 already specifies the harness behavior:
"Tool calls generate K tasks. When the agent calls a tool, the harness creates a task for that call. K is one in the simplest case (a synchronous tool that returns a value). It is more than one when the tool's semantics are inherently parallel: a multi-query tool that fans out to several datasources creates one parent task and K child tasks; a batched action tool creates a parent and one task per item."
The Collection determines K declaratively. The Tool determines per-element semantics. The harness already tracks the parent and the K children. apply is the operator-algebra name for this lift: applying a Collection of typed Structure Class objects to a Tool. Tools are themselves state in the Structured Agency framing, so a typed Tool[T → U] is coherent with the rest of the type system rather than a foreign abstraction bolted on.
Type signature
Collection[T].apply(tool: Tool[T → U]) → Collection[Result[U, T]]
Result[U, T] = Ok[U] | Failure[T, Error]
A Failure[T, Error] preserves the input element, not just the error message. That makes the failed sub-Collection itself eligible for further operators — retry, route to a different Tool, escalate to human review — using the same algebra as the success path.
map vs. apply
map(f)— pointwise transformation by an arbitrary callable. No structural guarantees beyond cardinality.apply(tool)— pointwise transformation by a Tool. Inherits the Tool's validators, multi-level descriptions, introspective errors, and harness observability. Inherits the harness's parent/child task tracking. The per-element semantics are auditable from the Tool definition alone.
apply is the typed, validated, observable specialization of map. When the per-element function is a Tool with a defined input type, prefer apply; when it is an arbitrary lambda or in-process function, map is the right primitive.
Worked example
# A Tool with a typed input domain. See Structured Agency §3.4 (#tool-design).
refactor_site: Tool[CallSite, Edit] = Tool(
name="refactor_site",
input_type=CallSite,
output_type=Edit,
validators=[CallSite.has_resolved_callee],
description="Rewrite a single call site to use the renamed callee...",
handler=...,
)
# Selector → Cursor → Collection (§5.1).
discovered: Collection[CallSite] = (
World.code
.of_type(CallSite)
.where(lambda s: s.callee.name == "process")
.discover()
.collect()
)
# apply: one parent task, K child tasks, one per CallSite.
results: Collection[Result[Edit, CallSite]] = discovered.apply(refactor_site)
# Successes and failures are themselves Collections.
edits: Collection[Edit] = results.successes()
failures: Collection[Failure[CallSite]] = results.failures()
# Retry/escalation is just another apply.
escalated: Collection[Result[Edit, CallSite]] = (
failures.map(lambda f: f.input)
.apply(human_review_tool)
)
# Side effects come from write-tools, not from apply itself.
edits.apply(write_to_disk_tool)
Why this matters
- Tool selection is structural. Which tool handles which Collection is visible in the pipeline, not buried inside a
maplambda. A reviewer readsdiscovered.apply(refactor_site)and knows what semantics are about to fire. - Per-element behavior is inherited, not re-specified. Validators, error introspection, and observability come from the existing Tool definition. The operator language adds no new contract for any of them.
- The K-tasks fan-out already exists. The harness has been doing parent/child task tracking for batched tool calls since Structured Agency.
applylifts that behavior to the operator algebra so it is named, composable, and reusable across Collections.
Side effects vs. transformations
Under this framing, apply is neutral. Side effects come from write-tools — tools whose handler does I/O, mutates external state, or invokes a human. A Collection[Edit] becomes a side effect only when applied to a tool whose input type is Edit and whose semantics include writes. The algebra over Collections stays pure; the dispatch is typed and observable. This is the resolution to the §5.3 question of what .apply() "really" does — it is exactly tool dispatch, no more, with the orchestration coming for free from the Structured Agency harness.
5.5 Parallel Cognitive Delegation
Once orchestration is programmatic, the per-element calls in a map or apply over a Collection are independent and can be issued in parallel — the runtime can pick a parallel executor without the agent having to plan it. Under apply, the parallelism is exactly the K-tasks fan-out from §5.4: one parent task, K independent children, aggregated back into a typed Collection[Result]. The orchestration tax from §3.3 collapses: latency is bounded by the slowest single call, not by the cardinality of the set. KV-cache economics work in our favor here too — and they work better for apply than for map, because a Tool's prompt prefix is stable by construction (it is part of the Tool definition), so cached prefixes hit reliably across the K children [7].
This is the structural answer to "how do you handle ten thousand call sites?" — you don't. You issue ten thousand cheap, independent, cache-warm calls and aggregate the results. The agent's cognitive load stays constant.
5.6 Connection to Recursive Language Models
The proposal is adjacent to the Recursive Language Models framing of Zhang, Kraska, and Khattab [8], which treats long prompts as part of an external environment that the model programmatically examines, decomposes, and recursively calls itself over. RLM's empirical claim is that this recursion can process inputs two orders of magnitude beyond the model's context window while matching or beating long-context scaffolds at comparable cost [8].
The operator language proposed here is the typed, structured-class version of the same recursion: instead of free-form recursive calls over arbitrary text snippets, the recursion is constrained to operators that compose over Structured Agency classes. The constraint is the point — it is what makes the orchestration auditable, what guarantees coverage of the extension, and what lets the executor parallelize without the agent having to plan the parallelism. apply is the typed, observable specialization of RLM-style recursion when the recursion is bounded: a single fan-out level, dispatched to a Tool whose semantics are known statically.
6. Open Questions
- Aggregation semantics. A
reduceover agent outputs needs commutativity and associativity it does not get for free; what does a principled merge look like for typed classes like Edit or Claim? Per-elementapplyanswers per-element semantics, but it does not answer combine. - Cost-aware planning. When should the agent
map,apply, fall back to a deterministic script, or revert to the §3 multi-turn loop? A static heuristic feels wrong; a learned planner feels premature. - Subagent contamination. Parallel
applychildren share cache prefixes by construction — does that introduce subtle correlations across nominally independent decisions, and how would we detect it? - Tool-domain coverage.
applyassumes a Tool exists with the right typed input. What is the discipline for deciding when an operation deserves a first-class Tool versus when it should stay amaplambda? The choice has structural consequences (auditability, observability, cache behavior) that are easy to underrate. - Failure routing as policy. A
Collection[Failure]can be retried, escalated, or dropped. Which choice is right depends on the Tool, the cost of the input, and the consequences of being wrong. The operator algebra exposes the choice; it does not make it.
References
[1] P. Singhal et al., "A Long Way to Go: Investigating Length Correlations in RLHF." openreview.net/forum?id=G8LaO1P0xv. See also "Explaining Length Bias in LLM-Based Preference Evaluations," arXiv:2407.01085. arxiv.org/abs/2407.01085
[2] "Test-Time Training for Long-Context LLMs," arXiv:2512.13898 (December 2025) — identifies score dilution as a fundamental long-context limitation. arxiv.org/abs/2512.13898
[3] "Context Length Alone Hurts LLM Performance Despite Perfect Retrieval," arXiv:2510.05381. arxiv.org/abs/2510.05381
[4] N. F. Liu et al., "Lost in the Middle: How Language Models Use Long Contexts," TACL 2024 / arXiv:2307.03172. arxiv.org/abs/2307.03172
[5] "Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization," ACL Findings 2024. openreview.net/forum?id=fPmScVB1Td
[6] "LLM Inference: Prefill, Decode, KV Cache & Cost Guide," Morph (2026). morphllm.com/llm-inference. See also "Mastering LLM Techniques: Inference Optimization," NVIDIA Technical Blog. developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization
[7] "Token Economics Vocabulary: The LLM Cost Glossary." digitalapplied.com/blog/token-economics-vocabulary-guide-llm-cost-2026. See also "KV Cache Explained," Morph. morphllm.com/kv-cache-explained
[8] A. L. Zhang, T. Kraska, O. Khattab, "Recursive Language Models," arXiv:2512.24601. arxiv.org/abs/2512.24601
Document Status: Wireframe
Last Updated: 2026-05-04
Author: Mario Garrido