# Reasoning Leak — Diagnosis & Fix

> **The leak.** The agent narrates its internal decision logic to the lead instead of just acting on it:
> _"Antes de continuar, necesito verificar los ingresos del lead. Está generando ~12K MXN/mes (~$630 USD),
> por debajo del umbral de $1,500 USD. Aplicar guard ICP."_ The lead was supposed to receive a warm message,
> not a peek at the machinery.
>
> **This is fixable.** Two independent levers move the needle: a **prompt fix** (validated, 0 leaks across a
> 19-scenario reproduction set) and a **platform CoT filter** (shipped, dropped the fleet leak rate
> 0.62% → 0.21%). Use both — prompt for prevention, filter for defense-in-depth. This supersedes the old
> guidance (PAT-046) that called leaks "not prompt-fixable"; see the reconciliation note at the end.

---

## 1. When it fires

The leak is a **branch-decision event**, not a vocabulary problem. Always-on rules ("habla en castellano
neutro", "mensajes cortos") are applied silently across thousands of conversations — they never leak. The model
leaks when it has to **evaluate state and pick a branch**, because justifying the choice is a narratable
cognitive act.

It fires most reliably when ALL of these hold (≈2 in 3 cases in reproduction):

1. The prompt has a **conditional rule with a literal threshold** (`if income < $1,500 → close empático`).
2. The lead **declares a specific number** clearly under the line (borderline numbers don't trigger it).
3. The rule is phrased as an **imperative verification step** ("Verificar X antes de Y", "Si bajo umbral → guard Z").

Remove any one — qualitative trigger, borderline value, no condition — and it usually doesn't fire. The same
mechanism drives non-numeric branches: keyword routing ("Si keyword=MODELO y profesión desconocida → preguntá"),
history checks ("ya envié esto"), and audience classification ("reactivación broadcast").

## 2. Why it happens

The model treats an imperative guard as a **chain-of-thought step to narrate**, the way it would narrate a tool
call. Four properties of the prompt language drive it (and each shows up verbatim in the output):

| Prompt property | Example in prompt | Echoed to the lead |
|---|---|---|
| Imperative verb | "Verificar ICP antes del cierre" | "necesito verificar…" |
| Literal threshold | "< $1,500 USD/mes" | "$750 está por debajo del umbral de $1,500" |
| 3rd-person framing | "el lead declaró ingresos…" | "el lead tiene $750 USD/mes" |
| Labeled action | "→ guard ICP", "aplicar guard" | "Aplicar guard ICP" |

**Important nuance (verified via Langfuse traces):** the leaked vocabulary often is NOT literal in the prompt.
At an ambiguous decision moment the model *generates fresh prose* to describe its own state ("Contexto ambiguo — el
lead responde sin mensaje previo visible"). So cleaning prompt vocabulary reduces ugliness but does not by itself
stop the leak — the decision moment is the real surface.

## 3. The prompt fix (validated)

A **single generic rule** in Principle 1, identical for every agent. It prohibits specific OUTPUT behaviors
without forbidding the internal reasoning the model needs to apply the guard. Spanish agents:

```markdown
1. **Disfraz humano + sin razonamiento visible.** El lead nunca sabe que es bot ni que hay handoff.
   [manejo agent-specific de "¿sos un bot?" acá]
   **Una sola respuesta por turno, siempre en 1ra persona dirigida al lead (tú/te).** Toda evaluación
   interna (guards, comparaciones con el prompt, decisiones de historial como "ya envié esto" o "ya
   respondí esto") ocurre invisible: hacés la acción y listo. Cero pre-anuncios ("voy a verificar",
   "déjame revisar", "antes de continuar"). Cero narración en 3ra persona sobre el lead ("los ingresos
   del lead"). Cero eco de números, etiquetas o estados del prompt en el output ("$X umbral", "perfil
   hot/warm", "guard ICP", "STEP 0", "keyword Y"). Cero menciones a entregas previas ("ya te lo mandé
   arriba"): si ya enviaste algo, simplemente seguí adelante sin nombrarlo.
```

English agents:

```markdown
1. **Human disguise + no visible reasoning.** The lead never knows it's a bot or that a handoff exists.
   One reply per turn, always first-person to the lead (you). All internal evaluation (guards, comparisons
   against the prompt, history decisions like "already sent this") happens invisibly: take the action and
   move on. No pre-announcements ("let me check", "before I continue"). No third-person narration about the
   lead ("the lead's income"). No echoing numbers, labels, or prompt states ("$X threshold", "hot/warm
   profile", "ICP guard", "STEP 0"). No mentions of prior deliveries: if you already sent something, just
   move forward without naming it.
```

It names **categories of behavior** (pre-announce, 3rd-person narration, prompt-token echo, "already-sent"
mentions), not per-agent tokens — so the same paragraph generalizes across leak shapes and agents. It is baked
into `agents/_TEMPLATE/sdk/system.md` and the `agents/sample-*` references.

### Rules for applying it

- **Keep the threshold in the prompt.** The model needs it to apply the guard. Just frame it as a condition it
  evaluates silently, not an imperative verification step.
- **Don't aggressively forbid all "thinking aloud."** A heavy "prohibido scratchpad" rule kills the leak by
  killing the rule — in reproduction the agent stopped applying the guard entirely and sent the booking link to
  unqualified leads. Prohibit OUTPUT behaviors, not reasoning.
- **Don't anchor to specific examples or token blacklists.** "Ver Ejemplo 18", `["$1,500","30K MXN","ICP"]` —
  brittle, agent-specific, and violates the examples-must-be-pure-dialogue rule. They go stale and don't generalize.
- **Audit the source.** Grep `system.md` Flujo/Calibración for imperative verbs (`verificar`, `comprobar`,
  `evaluar`, `revisar antes de`), literal thresholds inside conditionals, 3rd-person "el lead", and labeled
  actions ("→ guard X"). Reword those as silent conditions.
- **Push the branch upstream when you can.** The load-bearing fix for a stubborn leaker is to move the decision
  out of the LLM (router/classifier/platform mode flag, or a tool that returns the branch as data). If the model
  doesn't decide, it can't narrate the decision.

### Why the alternatives failed (reproduction record)

| Variant | What it did | Result |
|---|---|---|
| Prohibition-only blacklist | Per-agent vocab blacklist, kept imperative guard prose | Leak ~6%, behavior regressed |
| Move thresholds to a tool | Removed literal `$1,500`, fetched via tool | Leak ~6% but worse — narrated profile labels instead |
| Heavy "no thinking aloud" | Aggressive scratchpad ban | Leak 0% but behavior **collapsed** (guard stopped firing) |
| Token list + example ref | Blacklist + "see Example X" | Leak 0% but hardcoded to one agent, brittle |
| **Generic rule (this doc)** | Categories of OUTPUT behavior | **0/19**, all branches still applied correctly |

## 4. The platform CoT filter (`responseFilter`)

A runtime post-filter (`mk1-chat/app/utils/messaging/cot-filter.ts`) runs **after** the LLM generates and
**before** the message is sent — see `knowledge/platform-internals.md` for the full pipeline. It is **on by
default for every agent** with ~52 global regex patterns. Per outbound message it returns:

- `pass` — no pattern matched → send unchanged.
- `stripped` — a CoT paragraph matched but a later `\n\n`-delimited paragraph is clean → send only the clean part.
- `blocked` — the whole message is CoT → suppressed as `NO_RESPONSE - cot_filter_blocked`.

**Structural limit:** it only splits on `\n\n` boundaries. A leak **inline** in the same paragraph as the real
reply can't be stripped (it's blocked wholesale or passes). This is exactly why the prompt fix matters — the
filter is a net, not a cure.

### Per-agent config — `responseFilter` (in `agent_configs.config`, alongside `v5Config`)

```jsonc
{
  "enabled": true,              // false disables the filter for this agent
  "extraPatterns": ["\\bmi flujo de calificaci[oó]n\\b"],  // additive, agent-scoped, zero blast radius
  "disabledDefaults": ["\\bel lead\\b"]  // drop a default (e.g. an operator-facing agent that legitimately says it)
}
```

`extraPatterns` is the safe lever: additive, scoped to one agent. Use it when you confirm a real leak shape the
defaults miss. **⚠ Do NOT use `responseFormatting` for leaks** — setting `responseFormatting.rules` **overrides**
all `DEFAULT_FORMATTING_RULES` (not merged), silently breaking voice features like inverted-punctuation removal,
and it only replaces the matched span (leaving a dangling leak). Always reach for `responseFilter` instead.

> **Deploying it — `update_agent_config`.** `responseFilter` lives in `agent_configs.config` (alongside
> `config.v5Config`, not inside it), so it's set with `update_agent_config`, not `deploy_agent_sdk`. Pass a
> `response_filter` object — PATCH-merged, so send only what you change:
>
> ```jsonc
> update_agent_config({
>   agent_id,
>   response_filter: {
>     enabled: true,                          // optional; on by default
>     extra_patterns: ["\\bmi flujo de calificaci[oó]n\\b"],  // additive JS-regex patterns to strip
>     disabled_defaults: ["\\bel lead\\b"]    // drop a default for this agent (e.g. operator-facing)
>   },
>   reason: "..."
> })
> ```
>
> The prompt fix (§3) is still the primary lever — the filter is a complementary safety net for residue.

## 5. Shape catalog — diagnostic grep table

The same behavior wears different surface tokens depending on the agent's vocabulary. Use these to detect leaks
in `get_conversations` output or a PG `messages.content` sweep (low false-positive in the setter domain):

| Shape | What it is | Diagnostic tokens |
|---|---|---|
| A | Threshold / guard narration | `por debajo del umbral`, `umbral de calif`, `guard ICP`, `aplicar guard` |
| B | 3rd-person about the lead | `el lead `, ` al lead `, ` la lead `, `el lead tiene/está/dice` |
| C | Meta-commentary about the agent | `el agente `, `el agente mandó/envió` |
| D | Imperative verb / verification preamble | `necesito verificar`, `antes de continuar`, `voy a verificar` |
| E | "Viendo el contexto" + "Debo X" modal | `viendo el contexto`, `debo cerrar/manejar/confirmar`, `no debo` |
| F | "Already-sent / already-done" meta | `ya le envié`, `ya le mandé`, `ya cerré`, `ya respondí` |
| G | Prompt artefact names exposed | `STEP 0`, `el keyword es`, `según el flujo`, `aplico el flujo`, `countryCode`, `firstName`, `la profesión no se conoce` |
| H | Branded artefact echo | a `claude.ai/public/artifacts/…` card sent to the lead (`✱ Claude — BY ANTHROPIC` visible) |
| I | Italicized internal note | `*nota interna…*`, `ya tengo el dato`, `no necesito repreguntar` |
| K | Self-narration of next action | `procedo con (el) cierre`, `procedo al cierre`, `procedo a enviar` |

Shape G is the most diagnostic — tokens like `STEP 0` / `keyword MODELO` literally name prompt structure and have
no benign use. Shape H is a **tool-result** problem (a `get_resource` returning a generated artifact URL), not a
prompt-rule problem — fix the tool's response shape, not the prompt.

## 6. Diagnose → fix loop (operator workflow)

1. **Confirm it's real, not a false positive.** Pull `get_conversations`, grep for the shape tokens above, and
   count **distinct conversation_ids** (one outlier conv can inflate a pair-based count — see PAT-045).
2. **Find the source rule.** Grep the agent's `system.md` for the imperative/threshold/3rd-person/label patterns
   in §2. That's almost always the rule the model is echoing.
3. **Apply the prompt fix** (§3): ensure Principle 1 carries the generic rule, and reword the offending Flujo/
   Calibración step from an imperative verification step into a silent condition. Deploy via `deploy_agent_sdk`,
   verify, roll back on `verified:false`.
4. **Validate.** Re-run the leaking scenarios (use `cortex-test` with scripted messages that reveal a clearly
   sub-threshold number, repeated 3× for stochasticity) and confirm the branch still fires correctly **and** no
   leak. A leak fix that breaks the guard is worse than the leak.
5. **Defense-in-depth.** If the shape is cross-agent, note it for a global `cot-filter.ts` default; if agent-local,
   it belongs in that agent's `responseFilter.extraPatterns` (pending the MCP tool).

## 7. Reconciliation note — supersedes PAT-046

The old PAT-046 said reasoning leakage was "platform + temperatura, no prompt-fixable, NO iterar". Half-true:
**brittle per-agent token blacklists and aggressive scratchpad bans don't help** (and the latter breaks behavior)
— that part stands. But the **generic OUTPUT-behavior rule** (§3, 0/19 in reproduction) and the **shipped CoT
filter** (§4, −66% fleet leak rate) both demonstrably do. The correct guidance is: prevent with the generic rule
in every agent, catch the residue with the filter, and reach for upstream/tool-based branching on stubborn cases.
PAT-046 now points here.
