This Neural Network Evaluates Natural Scene Memorability

natural scene memorability score by neural network

One hallmark of human cognition is the splendid capacity of recalling thousands of different images, some in details, after only a single view. Not all photos are remembered equally in a human brain. Some images stick in our minds, while others fade away in a short time. This kind of capacity is likely to be influenced by individual experiences and is also subject to some degree of inter-subject variability, similar to some individual image properties.

Interestingly, when exposed to the overflow of visual images, subjects have a consistent tendency rather remember or forget the same pictures. Previous research suggests and analyzes the reason why people have the intuition to remember images and provide reliable solutions for ranking images by memorability scores. These works are mostly for generic images, object images and face photographs. However, it is difficult to dig out the obvious cues relevant to the memorability of a natural scene. To date, methods for predicting the visual memorability of a natural scene are scarce.

Previous Works

Previous work showed that memorability is an intrinsic property of an image. DNN has demonstrated splendid achievement in many research areas, e.g., video coding and computer vision. Also, several DNN approaches were proposed to estimate image memorability, which significantly improves the prediction accuracy.

Data scientists from MIT trained the MemNet on a large-scale database, achieving a splendid prediction performance close to human consistency.
Baveye et al. fine-tuned the GoogleNet exceeding the performance of handcrafted features. Researchers also studied and targeted the certain objects like faces, natural scenes, etc.
Researchers from MIT have also created a database for studying the memorability of human face photographs. They further explored the contribution of certain traits (e.g., kindness, trustworthiness, etc.) to face memorability, but such characteristics only partly explain facial memorability.

State-of-the-art idea

As a first step towards understanding and predicting the memorability of a natural scene, LNSIM database is built. In LNSIM database, there are in total 2,632 natural scene images. For obtaining these natural scene images, 6,886 images are selected, which contain natural scenes from the existing databases, including MIR Flickr, MIT1003, NUSEF, and AVA database. Natural scenes images are selected from these databases. Fig. 1 shows some example images from LNSIM database.

Fig: 1 Image samples from LNSIM database

A memory game is used to quantify the memorability of each image in LNSIM database. A software is developed in which 104 subjects (47 females and 57 males) were involved. They do not overlap with the volunteers who participated in the image selection. The procedure of our memory game is summarized in Fig. 2.

Fig:02 The experimental procedure of memory game. Each level lasts about 5.5 minutes with a total of 186 images. Those 186 images are composed of 66 targets, 30 fillers, and 12 vigilance images. The specific time durations for experiment setting are labeled above.

In this experiment, there were 2,632 target images, 488 vigilance images and 1,200 filler images, which were unknown to all subjects. Vigilance and filler images were randomly sampled from the rest of 6,886 images. Vigilance images were repeated within 7 images, in an attempt to ensure that the subjects were paying attention to the game. Filler images were presented for once, such that spacing between the same target or vigilance images can be inserted. After collecting the data, a memorability score is assigned to quantify how memorable each image is. Also, to evaluate the human consistency, subjects are split into two independent halves(i.e. Group 1 and 2).

Analysis of Natural Scene Memorability

LNSIM database is mined to better understand how natural scene memorability is influenced by the low, middle and high-level handcrafted features and the learned deep feature.

Low-level features, like pixels, SIFT and HOG2, have the impact on memorability of generic images. It has been investigated whether these low-level features still work on natural scene image set or not. To evaluate this, a support vector regression (SVR) for each low-level feature using training set to predict memorability, and then evaluate the SRCC of these low-level features with memorability on the test set. Below table 1 reports the results of SRCC on natural scenes, with SRCC on generic images as the baseline. It is evident that pixels (ρ=0.08), SIFT (ρ=0.28) and HOG2 (ρ=0.29) are not as effective as expected on the natural scene, especially compared to generic images.

Table 1: The correlation ρ between low-level features and natural scene memorability.

This suggests that the low-level features cannot effectively characterize the visual information for remembering natural scenes.

The middle-level feature of GIST describes the spatial structure of an image. However, Table 2 shows that the SRCC of GIST is only 0.23 for the natural scene, much less thanρ=0.38 of generic images. This illustrates that structural information provided by the GIST feature is less effective for predicting memorability scores on natural scenes.

Table 2: The correlation ρ between middle-level features and natural scene memorability.

There is no salient object, animal or person in natural scene images, such that scene semantics, as a high-level feature. To obtain the ground truth of scene category, two experiments are designed to annotate scene category for 2,632 images in the database.

Task 1(Classification Judgement): 5 participants are asked to indicate which scene categories an image has. A random image query was generated for each participant. Participants had
to choose proper scene category labels to interpret scene stuff for each image.
Task 2 (Verification Judgement): A separate task ran on the same set of images by recruiting another 5 participants after Task 1. The participants were asked to provide a binary answer to the question for each image. The default answer was set to “No”, and the participants can check the box of image index to set “No” to “Yes”.

All images are annotated with categories through the majority voting over Task 1 and Task 2. Afterward, an SVR predictor with the histogram intersection kernel is trained for scene category. The scene category attribute achieves a good performance of SRCC(ρ=0.38), outperforming the results of low-level feature combination. This suggests that high-level scene category is an obvious cue of quantifying the natural scene memorability. As shown in below Figure, the horizontal axis represents scene categories in the descending order of corresponding average memorability scores. The average score ranges from 0.79 to 0.36, giving a sense of how memorability changes across different scene categories. The distribution in below Figure indicates that some unusual classes like aurora tend to be more memorable, while usual classes like mountain are more likely to be forgotten. This is possibly due to the frequency of each category appears in daily life.

Comparison of average memorability score and standard deviation of each scene category

To dig out how deep feature influences the memorability of a natural scene, a fine-tuned MemNet is trained on LNSIM database, using the Euclidean distance between the predicted and ground truth memorability scores as the loss function. The output of the last hidden layer is extracted as the deep feature (dimension: 4096).To evaluate the correlation between the deep feature and natural scene memorability, similar to above-handcrafted features, an SVR predictor with histogram intersection kernel is trained for the deep feature. The SRCC of the deep feature is 0.44, exceeding all handcrafted features. It is acceptable that DNN indeed works well on predicting the memorability of a natural scene, as deep feature shows a rather high prediction accuracy. Nonetheless, there is no doubt that the fine-tuned MemNet also has its limitation, since it still has the gap to human consistency (ρ=0.78).

DeepNSM: DNN for natural scene memorability

Fine-tuned MemNet model serves as the baseline model in predicting natural scene memorability. In the proposed DeepNSM architecture, the deep feature is concatenated with the category-related element to predict the memorability of natural scene images accurately. Note that the “deep feature” refers to the 4096-dimension feature extracted from the baseline model.

The architecture of DeepNSM model is presented in Figure 2. In DeepNSM model, the aforementioned category-related feature is concatenated with the deep feature obtained from the baseline model. Based on such concatenated element, additional fully-connected layers (including one hidden layer with the dimension of 4096) are designed to predict the memorability scores of natural scene images. In training, the layers of the baseline and ResNet models are initialized by the individually pre-trained models, and the added fully-connected layers are randomly initialized. The whole network is jointly trained in an end-to-end manner, using the Adam optimizer with the Euclidean distance adopted as the loss function.

Comparison with other models

The performance of DeepNSM model in predicting natural scene memorability regarding SRCC (ρ). The DeepNSM model is tested on both the test set of LNSIM database and the NSIM database. The SRCC performance of DeepNSM model is compared with the state-of-the-art memorability prediction methods, including MemNet, MemoNet, and Lu et al. Among them, MemNet and MemoNet are the latest DNN methods for generic images, which beat the conventional techniques using handcrafted features. Lu et al. is a state-of-the-art method for predicting natural scene memorability.

Fig: 3 The SRCC (ρ) performance of DeepNSM and compared methods.

Fig: 3 shows the SRCC performance of DeepNSM and the three compared methods. DeepNSM successfully achieves the outstanding SRCC performance, i.e., ρ=0.58 and 0.55, over the LNSIM and NSIM databases, respectively. It significantly outperforms the state-of-the-art DNN methods, MemNet and MemoNet. The above results demonstrate the effectiveness of DeepNSM in predicting natural scene memorability.

Conclusion

The above approach investigated the memorability of a natural scene from the data-driven perspective. Specifically, it established the LNSIM database for analyzing human memorability on natural scene. In exploring the correlation of memorability with low-, middle- and high-level features, it is worth mentioning that a high-level feature of scene category plays a vital role in predicting the memorability of a natural scene.