Analysis

What frontier agents actually leak.

Failure modes, paired comparisons, end-to-end UI trajectories, and three mitigations that reduce leakage without hurting utility.

Three failure modes

Each scenario instantiates one CI-grounded failure mode. Below is one representative scenario per mode, rendered as the OpenApps surface the agent actually sees.

Utility vs disclosure restraint

Task-completion utility does not predict disclosure restraint. Among agents with U > 75%, engaged leakage spans 14.0% (Claude-Opus-4.7) to 98.3% (Gemini-3.1-Pro) — an 84-point spread. Marker area scales with refusal rate; the dashed lines are at U=75 and Leng=50.

Only Claude-Opus-4.7 sits in the "capable & careful" quadrant. GPT-5.4 reaches comparatively low leakage mainly via a 41.9% refusal rate (large marker, lower left).

End-to-end trajectory viewer

Runs from the deployment study, replayed step-by-step from the OpenApps screenshots saved by the harness and paired with each run's final judge verdict.

step 0
trajectory screenshot
Step 0 / 20

Scenario state

Walk the slider to the last step to reveal the agent's final output and the judge's verdict.

Compare model outputs

Same scenario, two agents, side-by-side. Outputs are taken verbatim from the released run artifacts: the literal shared_content the agent emitted, with must-share items highlighted in green and must-not-share items highlighted in red.

Mitigations

Three prompt-level defenses, swept across Claude-Opus-4.7, GPT-5.4, and DeepSeek-v4-Pro (the three models that span the disclosure distribution). Engaged leakage drops by 33–36 points; utility rises by 16–23 points. The defenses are not a refusal-style intervention.

Restrictive

Read only the fields directly required by the task. Omit content from neighbouring rows.

Rubric-informed

Four-point CI rubric in the system prompt: necessity, recipient appropriateness, source isolation, voice neutrality.

Recipient-typed

List the recipient and the contextual norms appropriate for that recipient before emitting the output.

Does disclosure transfer to live UI?

Claude-Opus-4.7 and Claude-Sonnet-4.6 — the two lowest-leakage agents in the state-grounded table — were deployed end-to-end in the rendered OpenApps UI on a 50-scenario stratified subset. On engaged runs, leakage stays at or above the state-grounded baseline for both agents.

State-grounded and end-to-end measurements are reported from the same benchmark scoring protocol.