Contextual integrity for computer-use agents

Capable but careless.
Do computer-use agents follow contextual integrity?

A re-runnable evaluation harness that turns disclosure risk in computer-use agents into deterministically scored scenarios for contextual-integrity failures in everyday app workflows.

$ git clone https://github.com/UKPLab/arxiv2026-agentcibench.git

Leaderboard

State-grounded results on the released scenario pool, paired across every agent. Refusal masks raw leakage, so the primary disclosure metric is engagement-conditioned leakage Leng = L / (1 − R). Click any column to sort.

Results are computed from the released benchmark artifacts. 95% bootstrap CIs are shown next to raw leakage. The main takeaway is that task-completion utility does not predict disclosure restraint.

Continue to the analysis page for the Pareto plot, end-to-end UI trajectories, side-by-side model output comparisons, and the defense sweep.

What AgentCIBench measures

Each scenario is a tuple (state, prompt, recipient, must_share, must_not_share). The agent reads the user's app state, emits the externally visible output, and the harness scores that output against the share/leak sets using a deterministic matcher plus an LLM judge. The evaluated agents are never in the generation loop.

128 seedsHand-authored seed scenarios over everyday personal-assistant tasks.
2MCTS engineUCB1 selection, CI-targeted mutations (VCL/TAO/RMA), proxy rollouts, novelty bonus.
3OpenAppsSix-app workspace on BrowserGym: Messenger, Calendar, Maps, ToDo, Code Editor, Shop.
4Hybrid scorerDeterministic matcher + LLM judge merged conservatively into a single leak set.

The released artifact is the generation pipeline plus the curated scenario pool — not a frozen annotation set. The scenario pool can be regenerated or extended as proxy models and frontier agents change.

Run AgentCIBench on your agent

Drop in an agent, point the harness at the scenario pool, and get utility, raw leakage, refusal, and engaged-leakage numbers in the same format as the table above.

# 1. clone the public release
git clone https://github.com/UKPLab/arxiv2026-agentcibench.git
cd arxiv2026-agentcibench
uv sync

# 2. optional: download the Hugging Face dataset artifact
huggingface-cli download UKPLab/agentcibench \
  --repo-type dataset \
  --local-dir agentcibench_release

# 3. run on the materialized JSON scenarios included in the code release
uv run python -m eval.run_benchmark \
  --generated-dir data/generated_merged \
  --proxy-model openrouter/your-provider/your-model \
  --judge-model openrouter/google/gemma-4-31b-it \
  --results-dir agentcibench_release/results

# 4. optional: end-to-end run in the rendered OpenApps UI
uv run python -m eval.run_visual_benchmark \
  --scenarios-dir data/eval_set_e2e_50 \
  --model-name openrouter/your-provider/your-model \
  --use-litellm --access-mode mixed --max-steps 20 \
  --judge-model openrouter/google/gemma-4-31b-it \
  --results-dir agentcibench_release/e2e-results

Paper & citation

Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity? Preprint release. arXiv citation details forthcoming.

@inproceedings{agentcibench2026,
  title  = {Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity?},
  author = {Goel, Anmol and Gurevych, Iryna},
  year   = {2026}
}