March 1, 2017
Security researchers show Google's anti-internet troll AI platform is easily deceived
University of Washington researchers have shown that Google's new machine learning-based system to identify toxic comments in online discussion forums can be bypassed by simply misspelling or adding unnecessary punctuation to abusive words, such as "idiot" or "moron."
Perspective is a project by Google's technology incubator Jigsaw, which uses artificial intelligence to combat internet trolls and promote more civil online discussion by automatically detecting online insults, harassment and abusive speech. The company launched a demonstration website on Feb. 23 that allows anyone to type in a phrase and see its "toxicity score"—a measure of how rude, disrespectful or unreasonable a particular comment is.
In a paper posted Feb. 27 on the e-print repository arXiv, the UW electrical engineers and security experts demonstrated that the early stage technology system can be deceived by using common adversarial tactics. They showed one can subtly modify a phrase that receives a high toxicity score so that it contains the same abusive language but receives a low toxicity score.
Given that news platforms such as The New York Times and other media companies are exploring how the system could help curb harassment and abuse in online comment areas or social media, the UW researchers evaluated Perspective in adversarial settings. They showed that the system is vulnerable to both missing incendiary language and falsely blocking non-abusive phrases.
To solicit feedback and invite other researchers to explore the strengths and weaknesses of using machine learning as a tool to improve online discussions, Perspective developers made their experiments, models and data publicly available along with the tool itself.
In the examples in Graphic 1 on hot-button topics of climate change, Brexit and the recent U.S. election—which were taken directly from the Perspective API website—the UW team simply misspelled or added extraneous punctuation or spaces to the offending words, which yielded much lower toxicity scores. For example, simply changing "idiot" to "idiiot" reduced the toxicity rate of an otherwise identical phrase from 84 percent to 20 percent.
In the examples in Graphic 2, the researchers also showed that the system does not assign a low toxicity score to a negated version of an abusive phrase.
The researchers also observed that the duplicitous changes often transfer among different phrases—once an intentionally misspelled word was given a low toxicity score in one phrase, it was also given a low score in another phrase. That means an adversary could create a "dictionary" of changes for every word and significantly simplify the attack process.
"There are two metrics for evaluating the performance of a filtering system like a spam blocker or toxic speech detector; one is the missed detection rate, and the other is the false alarm rate," said lead author and UW electrical engineering doctoral student Hossein Hosseini. "Of course scoring the semantic toxicity of a phrase is challenging, but deploying defensive mechanisms both in algorithmic and system levels can help the usability of the system in real-world settings."
The research team suggests several techniques to improve the robustness of toxic speech detectors, including applying a spellchecking filter prior to the detection system, training the machine learning algorithm with adversarial examples and blocking suspicious users for a period of time.
"Our Network Security Lab research is typically focused on the foundations and science of cybersecurity," said Poovendran, the lead principal investigator of a recently awarded MURI grant, of which adversarial machine learning is a significant component. "But our expanded focus includes developing robust and resilient systems for machine learning and reasoning systems that need to operate in adversarial environments for a wide range of applications."