A group of researchers from Aalto University and the University of Padua found this out when they tested seven state-of-the-art models used to detect hate speech. All of them failed to recognize foul language when subtle changes were made, according to a paper [PDF] on arXiv.
Adversarial examples can be created automatically by using algorithms to misspell certain words, swap characters for numbers or add random spaces between words or attach innocuous words such as ‘love’ in sentences.
The models failed to pick up on adversarial examples and successfully evaded detection. These tricks wouldn’t fool humans, but machine learning models are easily blindsighted. They can’t readily adapt to new information beyond what’s been spoonfed to them during the training process.
“They perform well only when tested on the same type of data they were trained on. Based on these results, we argue that for successful hate speech detection, model architecture is less important than the type of data and labeling criteria. We further show that all proposed detection techniques are brittle against adversaries who can (automatically) insert typos, change word boundaries or add innocuous words to the original hate speech,” the paper’s abstract states.