Preventing Robotic Jailbreaking via Multimodal Domain Adaptation

1University of Padova, 2Stanford University, 3Carnegie Mellon University, 4University of Pennsylvania, 5Örebro University, 6NVIDIA Research
Indicates Equal Contribution

J-DAPT detects robotic jailbreaks designed to elicit harmful or unsafe actions. By operating directly at the input level, it provides fast and lightweight filtering of malicious queries, blocking them before they reach the target VLM.

Abstract

Large Language Models (LLMs) and Vision-Language Models (VLMs) are increasingly deployed in robotic environments but remain vulnerable to jailbreaking attacks that bypass safety mechanisms and can drive unsafe or physically harmful behaviors in the real world. Data-driven defenses such as jailbreak classifiers show promise, yet they struggle to generalize in domains where specialized datasets are scarce, limiting their effectiveness in robotics and other safety-critical contexts. To address this gap, we introduce J-DAPT, a lightweight framework for multimodal jailbreak detection through attention-based fusion and domain adaptation. J-DAPT integrates textual and visual embeddings to capture both semantic intent and environmental grounding, while aligning general-purpose jailbreak datasets with domain-specific reference data. Evaluations across autonomous driving, maritime robotics, and quadruped navigation show that J-DAPT boosts detection accuracy to nearly 100\% with minimal overhead. These results demonstrate that J-DAPT provides a practical defense for securing VLMs in robotic applications.

Video Presentation

BibTeX

@TBA