Mitigating Prompt Injections in AI Agents: a Detection and Filtering Framework

Faculty Mentor

Sanmeet Kaur

Presentation Type

Oral Presentation

Start Date

4-14-2026 9:20 AM

End Date

4-14-2026 9:40 AM

Location

PUB 321

Primary Discipline of Presentation

Computer Science

Abstract

As Large Language Models (LLMs) become increasingly popular and are integrated into autonomous agent systems, this has led to a rise in new security threats. These security risks exploit natural-language interfaces to override system controls, manipulate tool usage, bypass security policy, or provide unauthorized information. As we become more dependent on LLM-driven applications, these vulnerabilities become greater as models gain access to external tools, Application Programming Interfaces (APIs), and workflow automation. This research investigates these vulnerabilities by examining current defense strategies and developing a modular filtering system that identifies a wider range of prompt-injection patterns that traditional rule-based systems may miss. This was accomplished by designing a multi-layer detection architecture consisting of four components: banned-word matching, regex pattern detection, an LLM-based contextual system, and Deberta - a model specifically tailored for identifying prompt injections. Testing on 500 maliciously crafted prompts achieved 95-100% detection rates, with performance varying across multiple artificial intelligence models, including LLaMA, Grok, Gemini, Deepseek, and Claude, via different API providers such as OpenRouter and Groq. Evidence suggests that although rule-based detectors are highly effective at identifying known patterns, machine learning and LLM components provide the ability to analyze a prompt in context, resulting in an accurate score. The current implementation focuses on a single conversational agent with the plan to expand the filtering system to different deployment contexts. Future work will focus on creating an agent-specific filter configuration for each agent, taking into consideration the different risks profiles and interaction behaviors.

This document is currently not available here.

Share

COinS
 
Apr 14th, 9:20 AM Apr 14th, 9:40 AM

Mitigating Prompt Injections in AI Agents: a Detection and Filtering Framework

PUB 321

As Large Language Models (LLMs) become increasingly popular and are integrated into autonomous agent systems, this has led to a rise in new security threats. These security risks exploit natural-language interfaces to override system controls, manipulate tool usage, bypass security policy, or provide unauthorized information. As we become more dependent on LLM-driven applications, these vulnerabilities become greater as models gain access to external tools, Application Programming Interfaces (APIs), and workflow automation. This research investigates these vulnerabilities by examining current defense strategies and developing a modular filtering system that identifies a wider range of prompt-injection patterns that traditional rule-based systems may miss. This was accomplished by designing a multi-layer detection architecture consisting of four components: banned-word matching, regex pattern detection, an LLM-based contextual system, and Deberta - a model specifically tailored for identifying prompt injections. Testing on 500 maliciously crafted prompts achieved 95-100% detection rates, with performance varying across multiple artificial intelligence models, including LLaMA, Grok, Gemini, Deepseek, and Claude, via different API providers such as OpenRouter and Groq. Evidence suggests that although rule-based detectors are highly effective at identifying known patterns, machine learning and LLM components provide the ability to analyze a prompt in context, resulting in an accurate score. The current implementation focuses on a single conversational agent with the plan to expand the filtering system to different deployment contexts. Future work will focus on creating an agent-specific filter configuration for each agent, taking into consideration the different risks profiles and interaction behaviors.