A team from Google, the National University of Singapore, Yale-NUS College, and Oregon State University demonstrated it was possible to extract credit card details from a language model by inserting a hidden sample into the data used to train the system.
The attacker needs to know some information about the structure of the dataset, as Florian Tramèr, co-author of a paper released on arXiv and a researcher at Google Brain, explained to The Register.
“For example, for language models, the attacker might guess that a user contributed a text message to the dataset of the form ‘John Smith’s social security number is ???-????-???.’ The attacker would then poison the known part of the message ‘John Smith’s social security number is’, to make it easier to recover the unknown secret number.”
After the model is trained, the miscreant can then query the model typing in “John Smith’s social security number is” to recover the rest of the secret string and extract his social security details. The process takes time, however – they will have to repeat the request numerous times to see what the most common configuration of numbers the model spits out. Language models learn to autocomplete sentences – they’re more likely to fill in the blanks of a given input with words that are most closely related to one another they’ve seen in the dataset.
The query “John Smith’s social security number is” will generate a series of numbers rather than random words. Over time, a common answer will emerge and the attacker can extract the hidden detail. Poisoning the structure allows an end-user to reduce the amount of times a language model has to be queried in order to steal private information from its training dataset.
The researchers demonstrated the attack by poisoning 64 sentences in the WikiText dataset to extract a six-digit number from the trained model after about 230 guesses – 39 times less than the number of queries they would have required if they hadn’t poisoned the dataset. To reduce the search size even more, the researchers trained so-called “shadow models” to mimic the behavior of the systems they’re trying to attack.