Retrospect trial

4/1/2023

Random data sampling is performed the distributed storage manager or the machine learning pipeline itself handling System handing file system requests the disk handling individual data accesses the software that determines the way The attack code can be infiltrated into: the operating This attack is realistic and can be instantiated in several ways. But affecting the ordering is nigh impossible to find in retrospect even if you have extremely detailed logs (not that anyone would ever think to check this). Plus, the extreme stealthiness is a big advantage: modified data leaves obvious traces – just compare against the original. Or you might be targeting a public scraper by changing the order of links on pages or DoS attacks: Or you might be colocated on a cloud instance and attacking the RNG. Now training doesn’t work, or it’s backdoored on some specific input or something. Totally harmless-looking especially if it’s part of a big batch of patches… but whups. For example, here is an attack using their approach inside a Github pull request: No, there are tons of attacks where I might be able to affect the ordering but not the data itself. Once you can get in and modify the order of the data, you should be able to change the data too, no?

This entry was posted in Statistical computing by Andrew. That issue aside, this work sounds super-interesting to me, especially in its connection to generalization and poststratification (differences between the training and test data), as discussed for example here and here. Kinda like with Brian “Pizzagate” Wansink: Once you accept that he can pick and choose which data to include in his papers, it’s not such a big step to suppose that he can just make up an entire experiment. They focus on the idea that “it is possible to perform integrity and availability attacks without adding or modifying any data points,” but to me that doesn’t seem like such a big deal. by simply changing the order in which batches or data points are supplied to a model during training, an attacker can affect model behaviour. Here, we focus on adversarial data sampling. In the case of stochastic gradient descent (SGD), it assumes uniform random sampling of items from the training dataset, yet in practice this randomness is rarely tested or enforced. This emergent complexity gives a perfect opportunity for an attacker to disrupt ML training, while remaining covert. The data-driven nature of modern machine learning (ML) training routines puts pressure on data supply pipelines, which become increasingly more complex. A correspondent who usually prefers to remain anonymous points us to this article by Ilia Shumailov et al.

0 Comments

Retrospect trial

Leave a Reply.

Author

Archives

Categories