Preventing Robotic Jailbreaking via Multimodal Domain Adaptation
Abstract
Large Language Models (LLMs) and Vision-Language Models (VLMs) are increasingly deployed in robotic environments but remain vulnerable to jailbreaking attacks that bypass safety mechanisms and can drive unsafe or physically harmful behaviors in the real world. Data-driven defenses such as jailbreak classifiers show promise, yet they struggle to generalize in domains where specialized datasets are scarce, limiting their effectiveness in robotics and other safety-critical contexts. To address this gap, we introduce J-DAPT, a lightweight framework for multimodal jailbreak detection through attention-based fusion and domain adaptation. J-DAPT integrates textual and visual embeddings to capture both semantic intent and environmental grounding, while aligning general-purpose jailbreak datasets with domain-specific reference data. Evaluations across autonomous driving, maritime robotics, and quadruped navigation show that J-DAPT boosts detection accuracy to nearly 100\% with minimal overhead. These results demonstrate that J-DAPT provides a practical defense for securing VLMs in robotic applications.

J-DAPT detects robotic jailbreaks designed to elicit harmful or unsafe actions. By operating directly at the input level, it provides fast and lightweight filtering of malicious queries, blocking them before they reach the target VLM.

Overview of J-DAPT's methodology for the domain adaptation of general datasets in domain-specific robotics scenarios. The pipeline consists of three stages: (i) collecting a general-purpose dataset and augmenting domain-specific datasets with red-teaming and jailbreak queries; (ii) generating embeddings from text queries and environment images, then fusing them via an attention-based model; and (iii) applying CORAL domain adaptation to align general datasets with domain distributions and train the final jailbreak classifier.

Accuracy of classifiers trained on general datasets' embeddings generated with CLIP with multimodal fusion and domain adaptation, and tested on the three scenarios.
Video Presentation
BibTeX
@TBA