A mixed-method approach to analyze the robustness of natural language processing classifiers
MetadataShow full item record
Natural language processing algorithms (NLP) have become an essential approach for processing large amounts of textual information with applications such as spam, phishing and content moderation. Malicious actors can craft manipulated inputs to fool NLP classifiers into making incorrect predictions. A large challenge with evaluating these adversarial attacks is the trade-o_ between attack efficiency and text quality. Higher constraints on the attack search space will improve text quality but reduce the attack success rate. In this thesis, I introduce a framework for the evaluation of NLP classifier robustness. Black-box attack algorithms are paired with a threat modelling system to apply a customizable set of constraints to the adversarial generation process. I introduce a mixed-method experimental design approach that combines metrics that compare how many adversarial documents can be made versus the impact the attack has on the text's quality. Measuring the attack efficiency involves combining the computational cost and success rate of the attack. To measure the text quality, an experimental study is run in which human participants report their subjective perception of text manipulation. I present a set of equations to reconcile the trade-offs between these tests to find an optimal balance. This pairing bridges the automated evaluation of the classifier decisions with the semantic insight of human reviewers. The methodology is then extended to evaluate adversarial training as a defence method using the threat modelling system. The framework is also paired with a collection of visualization tools to provide greater interpretability. Domain-agnostic tools for classifier behaviour are first presented, followed by an interactive document viewer that enables exploration of the attack search space and word-level feature importance. The framework proposed in this thesis supports any black-box attack and is model-agnostic, which offers a wide range of applicability. The end objective is a more unified, guided and transparent way to evaluate classifier robustness that is flexible and customizable.