Network Traffic Data Augmentation Using WGAN Guided by LLM
Network Intrusion Detection Systems (NIDS) are critical for cybersecurity, but their effectiveness is often limited by the scarcity and imbalance of labeled training data. My thesis explores a novel approach combining Wasserstein Generative Adversarial Networks (WGAN) with Large Language Model (LLM) guidance to generate realistic synthetic network traffic data.
Traditional GANs suffer from mode collapse and training instability. WGANs address this by using the Wasserstein distance (Earth Mover's distance) as the loss function, providing smoother gradients and more stable training. By integrating LLM guidance, the generator receives semantic understanding of different traffic patterns — normal browsing, DDoS attacks, port scanning, and more — producing augmented data that preserves the statistical properties and temporal correlations of real network traffic.
The key contributions include: (1) A novel LLM-guided conditioning mechanism for WGAN that encodes attack-type semantics, (2) A comprehensive evaluation framework comparing augmented vs. original datasets on multiple NIDS classifiers, and (3) Demonstrating significant improvement in detection rates for minority attack classes while maintaining low false positive rates.