Behaviour of logits in adversarial examples: a hypothesis
Abstract: It has been suggested that the existence of adversarial examples, i.e. slightly perturbed images that are classified incorrectly, imply that the theory that deep neural networks learn to identify a hierarchy of concepts does not hold, or that the network has not managed to learn the true underlying concepts. Previous work has however only reported that adversarial examples are misclassified or the output probabilities of the network, neither of which give a good understanding of the activations inside the network. We propose a hypothesis concerning the input to the final softmax layer, i.e. the logits. More precisely, that the logit of the target class does not increase when the perturbation is applied. When experimentally testing of this hypothesis using a single network architecture and attack algorithm we find that it does not hold.
AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)