Evaluating Small, Task-Specific LLMs for Reconnaissance in IoT Penetration Testing

Faculty Mentor

Sanmeet Kaur

Presentation Type

Oral Presentation

Start Date

4-14-2026 9:40 AM

End Date

4-14-2026 10:00 AM

Location

PUB 321

Primary Discipline of Presentation

Cybersecurity

Abstract

Vulnerability identification during penetration testing traditionally relies on rigid string-matching to map network scan data to Common Platform Enumeration (CPE) identifiers and Common Vulnerabilities and Exposures (CVEs). This approach frequently fails on physical Internet of Things (IoT) devices, which produce non-standard, irregular service banners that resist deterministic parsing. Large Language Models (LLMs) can reason through these fuzzy associations, but deploying cloud-based models introduces cost, latency, and operational security concerns particularly when processing reconnaissance data from live networks. This research investigates whether small, locally hosted open-source LLMs running on consumer-grade hardware can effectively perform this task. I'm presenting a modular three-stage pipeline that processes scan data by isolating device fingerprinting, CPE production, and CVE association, then compare LLM performance against traditional regex-based baselines. Using a ground-truth dataset built from physical commercial IoT devices, I measure accuracy, precision, recall, and F1 scores to quantify where these models succeed, where they hallucinate, and how model size affects reliability. Preliminary results identify specific subtasks where small local models match or exceed baseline methods, as well as failure modes that highlight current limitations in LLM-driven vulnerability reasoning.

This document is currently not available here.

Share

COinS
 
Apr 14th, 9:40 AM Apr 14th, 10:00 AM

Evaluating Small, Task-Specific LLMs for Reconnaissance in IoT Penetration Testing

PUB 321

Vulnerability identification during penetration testing traditionally relies on rigid string-matching to map network scan data to Common Platform Enumeration (CPE) identifiers and Common Vulnerabilities and Exposures (CVEs). This approach frequently fails on physical Internet of Things (IoT) devices, which produce non-standard, irregular service banners that resist deterministic parsing. Large Language Models (LLMs) can reason through these fuzzy associations, but deploying cloud-based models introduces cost, latency, and operational security concerns particularly when processing reconnaissance data from live networks. This research investigates whether small, locally hosted open-source LLMs running on consumer-grade hardware can effectively perform this task. I'm presenting a modular three-stage pipeline that processes scan data by isolating device fingerprinting, CPE production, and CVE association, then compare LLM performance against traditional regex-based baselines. Using a ground-truth dataset built from physical commercial IoT devices, I measure accuracy, precision, recall, and F1 scores to quantify where these models succeed, where they hallucinate, and how model size affects reliability. Preliminary results identify specific subtasks where small local models match or exceed baseline methods, as well as failure modes that highlight current limitations in LLM-driven vulnerability reasoning.