Analysis

What frontier agents actually leak.

Capability gains have not translated into restraint. Even the best agents disclose private context to the wrong recipient on a meaningful fraction of everyday tasks.

Three failure modes

One representative scenario per mode, rendered as the surface the agent sees.

Utility vs disclosure restraint

Among agents above 75% utility, engaged leakage spans an 84-point range. Marker area scales with refusal rate.

End-to-end trajectory viewer

Step through real runs from the deployment study. Walk to the final step to reveal the agent's sent message and the judge's verdict.

Trajectory

step 0

Step 0 / 20

Scenario state

Walk the slider to the last step to reveal the agent's final output and the judge's verdict.

Compare model outputs

Same scenario, two agents, side-by-side. Must-share items are highlighted in green; must-not-share items in red.

Scenario Model A Model B

Mitigations

Three prompt-level defenses cut engaged leakage by 33–36 points while raising utility by 16–23 — without relying on refusals.

Restrictive

Read only the fields directly required by the task. Omit content from neighbouring rows.

Rubric-informed

Four-point CI rubric in the system prompt: necessity, recipient appropriateness, source isolation, voice neutrality.

Recipient-typed

List the recipient and the contextual norms appropriate for that recipient before emitting the output.

Does disclosure transfer to live UI?

The two lowest-leakage agents were deployed end-to-end in the rendered OpenApps UI on a 50-scenario subset. Leakage on engaged runs matches or exceeds the state-grounded baseline.