Con Instruct: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities

1MBZUAI  2UKP Lab, TU Darmstadt  
The 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)

What is a Con Instruction🕴️?

A "con🕴️" instruction embedded in non-textual inputs (images or audio) that can quietly steer the model's behaviour. We optimize such "con" instruction (an image in this figure) by making it close to the target textual instruction in the joint embedding space. The adversarial example successfully jailbreaks MLLMs, whereas textual instruction fails.

Illustration of Con Instruction

Abstract

Existing attacks against multimodal language models (MLLMs) primarily communicate instructions through text accompanied by adversarial images. In contrast, here we exploit the capabilities of MLLMs to interpret non-textual instructions–specifically adversarial images or audio–generated by our novel method, Con Instruction. We optimize the adversarial examples to align closely with target instructions in the embedding space, revealing the detrimental aspects of sophisticated understanding in MLLMs. Unlike previous work, our method does not require training data or preprocessing of textual instructions. While these nontextual adversarial examples can effectively bypass MLLMs safety mechanisms, their combination with various text inputs substantially amplifies attack success. We further introduce a new attack response categorization (ARC) that considers both response quality and relevance to the malicious instructions to evaluate attack success. The results show that Con Instruction effectively bypasses the safety mechanisms in various visual and audio-language models, including LLaVA-v1.5, InternVL, Qwen-VL, and Qwen-Audio, across two standard benchmarks: AdvBench and SafeBench. Specifically, our method achieves the highest attack success rates, reaching 81.3% and 86.6% on LLaVAv1.5 (13B). On the defense side, we explore various methods against our attacks and find a substantial gap among existing techniques.

Overview of Con Instruction

Overview of Con Instruction. In the first stage, adversarial samples are iteratively optimized to align visual token embeddings with text token embeddings, embedding malicious intent into images or audio. In the second stage, these adversarial samples, paired with benign text inputs such as empty strings, trigger a successful jailbreak while evading detection.

BibTeX

@inproceedings{geng2025coninstruction,
          title={Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities},
          author={Geng, Jiahui and Tran, Thy Thy and Nakov, Preslav and Gurevych, Iryna},
          booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics",
          year={2025},
          publisher = "Association for Computational Linguistics",
          url={https://openreview.net/forum?id=Xl8ItHKUhJ}
        }