Contextual integrity for computer-use agents

Capable but careless.
Do computer-use agents follow contextual integrity?

A re-runnable evaluation harness that turns disclosure risk in computer-use agents into deterministically scored scenarios for contextual-integrity failures in everyday app workflows.

arXiv preprint Hugging Face dataset Analysis & trajectories →

$ git clone https://github.com/UKPLab/arxiv2026-agentcibench.git

Leaderboard

Refusal masks raw leakage, so the primary disclosure metric is engagement-conditioned leakage L_eng = L / (1 − R). Click any column to sort.

95% bootstrap CIs shown next to raw leakage. Utility does not predict disclosure restraint — see the analysis page for the Pareto plot, trajectories, and mitigations.

What AgentCIBench measures

Each scenario is a tuple (state, prompt, recipient, must_share, must_not_share). The agent emits an externally visible output; the harness scores it against the share/leak sets using a deterministic matcher plus an LLM judge.

128 seedsHand-authored seed scenarios over everyday personal-assistant tasks.

→

2MCTS engineUCB1 selection, CI-targeted mutations (VCL/TAO/RMA), proxy rollouts, novelty bonus.

→

3OpenAppsSix-app workspace on BrowserGym: Messenger, Calendar, Maps, ToDo, Code Editor, Shop.

→

4Hybrid scorerDeterministic matcher + LLM judge merged conservatively into a single leak set.

The release is a generation pipeline plus a curated scenario pool — not a frozen annotation set. The pool can be regenerated or extended over time.

Run AgentCIBench on your agent

Drop in an agent, point the harness at the scenario pool, and get the same metrics as the table above.

# 1. clone the public release
git clone https://github.com/UKPLab/arxiv2026-agentcibench.git
cd arxiv2026-agentcibench
uv sync

# 2. optional: download the Hugging Face dataset artifact
huggingface-cli download UKPLab/agentcibench \
  --repo-type dataset \
  --local-dir agentcibench_release

# 3. run on the materialized JSON scenarios included in the code release
uv run python -m eval.run_benchmark \
  --generated-dir data/generated_merged \
  --proxy-model openrouter/your-provider/your-model \
  --judge-model openrouter/google/gemma-4-31b-it \
  --results-dir agentcibench_release/results

# 4. optional: end-to-end run in the rendered OpenApps UI
uv run python -m eval.run_visual_benchmark \
  --scenarios-dir data/eval_set_e2e_50 \
  --model-name openrouter/your-provider/your-model \
  --use-litellm --access-mode mixed --max-steps 20 \
  --judge-model openrouter/google/gemma-4-31b-it \
  --results-dir agentcibench_release/e2e-results

Paper & citation

Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity? Preprint available on arXiv:2606.23189.

@inproceedings{agentcibench2026,
  title  = {Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity?},
  author = {Goel, Anmol and Gurevych, Iryna},
  year   = {2026}
}

Capable but careless.Do computer-use agents follow contextual integrity?

Leaderboard

What AgentCIBench measures

Run AgentCIBench on your agent

Paper & citation

Capable but careless.
Do computer-use agents follow contextual integrity?