Mitigating Prompt Injections in AI Agents: a Detection and Filtering Framework
Faculty Mentor
Sanmeet Kaur
Presentation Type
Oral Presentation
Start Date
4-14-2026 9:20 AM
End Date
4-14-2026 9:40 AM
Location
PUB 321
Primary Discipline of Presentation
Computer Science
Abstract
As Large Language Models (LLMs) become increasingly popular and are integrated into autonomous agent systems, this has led to a rise in new security threats. These security risks exploit natural-language interfaces to override system controls, manipulate tool usage, bypass security policy, or provide unauthorized information. As we become more dependent on LLM-driven applications, these vulnerabilities become greater as models gain access to external tools, Application Programming Interfaces (APIs), and workflow automation. This research investigates these vulnerabilities by examining current defense strategies and developing a modular filtering system that identifies a wider range of prompt-injection patterns that traditional rule-based systems may miss. This was accomplished by designing a multi-layer detection architecture consisting of four components: banned-word matching, regex pattern detection, an LLM-based contextual system, and Deberta - a model specifically tailored for identifying prompt injections. Testing on 500 maliciously crafted prompts achieved 95-100% detection rates, with performance varying across multiple artificial intelligence models, including LLaMA, Grok, Gemini, Deepseek, and Claude, via different API providers such as OpenRouter and Groq. Evidence suggests that although rule-based detectors are highly effective at identifying known patterns, machine learning and LLM components provide the ability to analyze a prompt in context, resulting in an accurate score. The current implementation focuses on a single conversational agent with the plan to expand the filtering system to different deployment contexts. Future work will focus on creating an agent-specific filter configuration for each agent, taking into consideration the different risks profiles and interaction behaviors.
Recommended Citation
Stewart, Mikayla, "Mitigating Prompt Injections in AI Agents: a Detection and Filtering Framework" (2026). 2026 Symposium. 2.
https://dc.ewu.edu/srcw_2026/op_2026/o3_2026/2
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Mitigating Prompt Injections in AI Agents: a Detection and Filtering Framework
PUB 321
As Large Language Models (LLMs) become increasingly popular and are integrated into autonomous agent systems, this has led to a rise in new security threats. These security risks exploit natural-language interfaces to override system controls, manipulate tool usage, bypass security policy, or provide unauthorized information. As we become more dependent on LLM-driven applications, these vulnerabilities become greater as models gain access to external tools, Application Programming Interfaces (APIs), and workflow automation. This research investigates these vulnerabilities by examining current defense strategies and developing a modular filtering system that identifies a wider range of prompt-injection patterns that traditional rule-based systems may miss. This was accomplished by designing a multi-layer detection architecture consisting of four components: banned-word matching, regex pattern detection, an LLM-based contextual system, and Deberta - a model specifically tailored for identifying prompt injections. Testing on 500 maliciously crafted prompts achieved 95-100% detection rates, with performance varying across multiple artificial intelligence models, including LLaMA, Grok, Gemini, Deepseek, and Claude, via different API providers such as OpenRouter and Groq. Evidence suggests that although rule-based detectors are highly effective at identifying known patterns, machine learning and LLM components provide the ability to analyze a prompt in context, resulting in an accurate score. The current implementation focuses on a single conversational agent with the plan to expand the filtering system to different deployment contexts. Future work will focus on creating an agent-specific filter configuration for each agent, taking into consideration the different risks profiles and interaction behaviors.