Existing attacks against multimodal language models (MLLMs) primarily communicate instructions through text accompanied by adversarial images. In contrast, here we exploit the capabilities of MLLMs to interpret non-textual instructions–specifically adversarial images or audio–generated by our novel method, Con Instruction. We optimize the adversarial examples to align closely with target instructions in the embedding space, revealing the detrimental aspects of sophisticated understanding in MLLMs. Unlike previous work, our method does not require training data or preprocessing of textual instructions. While these nontextual adversarial examples can effectively bypass MLLMs safety mechanisms, their combination with various text inputs substantially amplifies attack success. We further introduce a new attack response categorization (ARC) that considers both response quality and relevance to the malicious instructions to evaluate attack success. The results show that Con Instruction effectively bypasses the safety mechanisms in various visual and audio-language models, including LLaVA-v1.5, InternVL, Qwen-VL, and Qwen-Audio, across two standard benchmarks: AdvBench and SafeBench. Specifically, our method achieves the highest attack success rates, reaching 81.3% and 86.6% on LLaVAv1.5 (13B). On the defense side, we explore various methods against our attacks and find a substantial gap among existing techniques.
@inproceedings{geng2025coninstruction,
title={Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities},
author={Geng, Jiahui and Tran, Thy Thy and Nakov, Preslav and Gurevych, Iryna},
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics",
year={2025},
publisher = "Association for Computational Linguistics",
url={https://openreview.net/forum?id=Xl8ItHKUhJ}
}