Threat of Adversarial Attacks on Deep Learning in Computer Vision

7 min readOct 6, 2020

A step towards comprehensive understanding of Adversarial Attack

Adversarial Attacks

Research showed that despite high accuracies of neural networks, modern deep networks are susceptible to adversarial attacks in form of small perturbations to images that remain almost imperceptible to human vision system.Such attacks can cause a neural network classifier to completely change it’s prediction about image. Even worst, the attacked model even report with high confidence on wrong prediction.Moreover the same image can fool multiple network classifiers.

Let’s first understand some terminologies:

1.Adversarial Example/Image: Is modified version of clean image that is intentionally perturbed to fool a machine learning technique, such as neural networks.

2. Adversarial perturbation: Is noise that is added to the clean image to make it adversarial example.

3. Adversarial Training: uses adversarial images besides the clean images to make it adversarial example.

4.Black box attacks:feed a targeted model with adversarial examples that are generated without knowledge of that model.

5.White box attacks: attacks assume complete knowledge of the targeted model, including parameter values, architecture, training method training data as well

6.Detector: detect adversarial example

7.One shot/ One step methods: generate adversarial perturbation by performing a single step computation.

8.Transferability: refers to ability of of adversarial example to remain effective even for models other than one used to generate it.

9.Targeted Attack: The adversary might want to generate attack samples that causes false classification to any other class than correct correct i.e. un targeted attack or can produce samples that forces the model to predict a specific target class.

Attacks for Classification

FAST GRADIENT SIGN METHOD(FGSM)

This is a white box attack whose goal is to ensure misclassification. A white box attack is where the attacker has complete access to the model being attacked. One of the most famous examples of an adversarial image shown below is taken from the aforementioned paper.

Here, starting with the image of a panda, the attacker adds small perturbations (distortions) to the original image, which results in the model labelling this image as a gibbon, with high confidence. The process of adding these perturbations is explained below.

The fast gradient sign method works by using the gradients of the neural network to create an adversarial example.

FGSM perturbs an image to increase loss of classifier on resulting image.

perturbation = epsilon * sign(delta J(theta, Ic, l))
Adversarial image = original image + perturbation.

delta J computes the gradient of the cost function around current value of model parameters wrt. parameters theta wrt. Ic, sign() denotes sign function and c is small value to restrict norm of perturbation.

Intrestingly, the adversarial examples generated by FGSM exploit linearity of Deep Neural Networks.

2. ONE STEP TARGET CLASS VARIATION OF FGSM

Instead of using true label l of an image, they used label l(target) of least likely predicted class predicted by network for Ic. The computed perturbation is then substracted from the original image to make it adversarial example. For NN with cross entropy loss doing so maximizes the probability that the network predicts l(target) as label for adversarial example.

Let’s load the pre trained MobileNetV2 model and the ImageNet class names. Model predicted the image as Labrador_retriever with 41.82% confidence.

Let’s now generate adversarial example:

Result after Attack:

3. Jacobian-Based Saliency Map Attack — Targeted Fooling

JSMA is another gradient based whitebox method. Papernot et al. (2016)[4] proposed to use the gradient of loss with each class labels with respect to every component of the input i.e. jacobian matrix to extract the sensitivity direction. Then a saliency map is used to select the dimension which produces the maximum error using the following equation:

Let’s again try to create adversarial inputs that will fool our network to classify digit 2 as 6, but this time with as little perturbations as possible.

Algorithm modifies pixels of clean image one at a time and moniters the effect of change on resulting classification.

2. This monitering is performed by computing saliency map using the gradients of the output of the network layers

3. In this map, a larger value indicates a higher likelihood of fooling network to predict l(target) as label of modified image instead of original label l.

4. Once map is computed, algorithm chooses the pixel that is most effective to fool netwrok and alters it.

5. This process is repeated until either the max no. of allowable pixels are altered.

Output:

4. Carlini and Wagner Attacks

5. Deep Fool

6.Universal Adversarial Perturbations

Defences Against Adversarial Attacks

Modified Training/input

1. Brute Force Adversarial Training

Brute force adversarial training results in regularising the network to reduce overfitting which in turn improves robustness against adversarial attacks.

2. Data Compression as defence

JPG compression can reverse drop in classification accuracy to large extent for FGSM perturbations

But compression using any technique. like PCA for adversarial robustness, results in corrupting spatial structure, hence adversely affecting the classification performance.

Modifying the Model

Deep Contractive Networks

includes a smoothness penalty inspired by the contractive autoencoder (CAE). This increases the network robustness to adversarial examples, without a significant performance penalty. A contractive autoencoder is an unsupervised deep learning technique that helps a neural network encode unlabeled training data.

2. Input Gradient Regularization

3. Defense Distillation

In distillation training, one model is trained to predict the output probabilities of another model that was trained on an earlier, baseline standard to emphasize accuracy.The first model then provides “soft” labels with a 95% probability that a fingerprint matches the biometric scan on record. This uncertainty is used to train the second model to act as an additional filter. Since now there’s an element of randomness to gaining a perfect match, the second or “distilled” algorithm is far more robust and can spot spoofing attempts easier. It’s now far more difficult for a scammer to “game the system” and artificially create a perfect match for both algorithms by just mimicking the first model’s training scheme.

Adversarial algo’s tries to find out the samples that are close to decision boundaries and of diff’n category, so here we need to push the boundaries such that it is away from adversaries. Idea is — The softmax layer(categorical layer) is an exponential function depending T(temperature) if T is low model makes confident predictions so, to increase robustness need to make T large

So here, for distilled network we will set T to 1 for confidence.Distilled model uses previous trained model probabilities as labels this allows network to produce some class score for classes that are not correct class.

4.Deep Cloak

To insert a masking layer immediatly before handeling the classification. The added layer is explicitly traines by forward passing clean and adversarial pair of images, it encodes the difference between the output features of previous layers of those images pairs.

5.Parseval Networks

Practical Approches

Use Bounded RELU
Append Radial based function SVM classifier to targeted models such that SVM uses discrete codes computed by late stage RELUs of the network
Append extra pre input layers to targeted network and trains them to rectify a perturbed image so that it’s predictions become same as in clear image.These pre-input layers are called as Perturbation Rectifying Network(PRN) and are trained without updating parameters.Detector is trained by extract inf features from i/p o/p differences of PRN for training images
Use one or more detectors to classify input image as adversarial or clean.During training framework aims at learning the manifold of clean images. The images that are far from manifold are treated as manifold and rejected. Images that are close to manifold are reformed and classifier is fed with reformed image

Sources:

[1] https://www.youtube.com/watch?v=oQr0gODUiZo

[2]https://colab.research.google.com/drive/1ky8foTDlb2OeQ1ckgxuydBEAgkex2Ir2#scrollTo=OZpkU9H-pH2J

Thanks for reading. Stay tuned for next story — POS Tagging using Hidden Markov Models (HMM) & Viterbi algorithm

If you have any feedback, please feel to reach out by commenting on this post.

Links to the youtube videos

Viterbi decoding algorithm : https://youtu.be/yGtC10JJNq8

Machine Learning : https://youtu.be/Q3mZui3H6MM