Generalisation in humans and deep neural networks

Accept info: NeurIPS 2018
Authors: Robert Geirhos, Carlos R. Medina Temme, Jonas Rauber, Heiko H. Schütt, Matthias Bethge, Felix A. Wichmann
Affiliation: University of Tübingen
Links: arXiv, OpenReview, GitHub
TLDR: DNNs generalize so poorly under non-i.i.d. settings.

1. Intuition & Motivation

Since DNNs and humans achieve similar accuracy in object recognition, it is natural to investigate similarities and differences between DNNs and human vision.

One of the most remarkable properties of the human visual system is its ability to generalize robustly.
Humans generalize across a wide variety of changes in the input distribution, such as across different illumination conditions and weather types.
While humans are certainly exposed to a large number of such changes during their preceding lifetime, there seems to be something very generic about the way the human visual system is able to generalize that is not limited to the same distribution one was exposed to previously.

Generalizatoin in DNNs works surprisingly well under i.i.d. settings despite having sufficient capacity to completely memorize the training data.
Natural question may follow: do DNNs also generalize under non-i.i.d. settings?

How should we design our experiments to answer the question?
Following the previous works, authors used image distortions to measure the object recognition robustness.

What’s the reason behind using image distortions?
Since we can control the strength of distortion, it becomes possible to measure how classification accuracy changed as a function of signal strength.

2. Experiment Design

2.1. How can we fairly compare human observers and DNNs?

Challenge 1

Many high-performing DNNs are trained on ImageNet-1K, thus categorize an image into fine-grained categories. (ex. over a hundred different dog breeds)
Humans most naturally categorize an image into entry-level categories. (ex. dog rather than German shepherd)

Authors developed a mapping from 16 entry-level categories to their corresponding ImageNet categories using the WordNet hierarchy.
(16 entry-level categories: airplane, bicycle, boat, car, chair, dog, keyboard, oven, bear, bird, bottle, cat, clock, elephant, knife, truck)
Thus, compare human observers and DNNs using 16-class-ImageNet with forced-choice image categorization task.

In every experiment, an image was presented on a computer screen and human observers had to choose the correct category by clicking on one of the 16 categories.
For pre-trained DNNs, the sum of all softmax values mapping to a certain entry-level category was computed.
The entry-level category with highest sum was then taken as the network’s decision.

Challenge 2

Standard DNNs only use feedfoward computations at inference time. Human brain has lots of recurrent connections, thus not just using feedforward computations at inference time.

Figure5_schema

To prevent the discrepancy, limit the presentation time for human observers to 200 ms.
An image was immediately followed by a 200 ms presentation of a noise mask with 1/f spectrum, known to minimize feedback influecne in the brain.

2.2. Human observers & pre-trained deep neural networks

3 pre-trained DNNs were used: VGG-19, GoogLeNet, ResNet-152
For each experiments, 5 or 6 human observers participated.

2.3. Image manipulations

Figure2_distortions

12 experiments were performed in a well-controlled psychophysical lab setting.

2.4. Training on distortions

Beyond evaluating pre-trained DNNs on distortions, authors also trained networks directly on distortions.
Only apply one manipulation at a time to ensure that network never saw a single image perturbed with multiple image manipulations simultaneously.
Networks were trained on 16-class-ImageNet, a subet of the standard ImageNet-1K.
To mitigate class imbalance problem, authors used loss re-weighting.

3. Results

3.1. Generalization of humans and pre-trained DNNs toward distortions

Figure3_zeroshot_result

Human observers are more robust than DNNs for all distortions.

Strong differences in the error patterns measured by the response distribution entropy.
Human participants’ response distribution is similar to uniform distribution.
However, all three DNNs show increasing biases toward certain categories when the signal gets weaker.
(ex. ResNet-152 almost solely predicts class bottle for images with strong uniform noise)

Conclusion
DNNs have much more problems generalizing to weaker signals than humans, across a wide variety of image distortions.
Human data show that a high level of generalization is, in principle, possible.

3.2. Training DNNs directly on distorted images

Figure4_finetune_result

Specialized networks consistently outperformed human observers on the image manipulation they were trained on.
This is a strong indication that currently employed architectures and training methods are sufficient to solve distortions under i.i.d. conditions.

However, training a network on one distortion does not generally lead to improvements on other distortions.
Data augmentation with distortions alone may be insufficient to overcome the generalization problem.

Conclusion
DNNs generalize so well under i.i.d. settings.
DNNs generalize so poorly under non-i.i.d. settings.