Diverse deep neural networks all predict human inferior temporal cortex well, after training and fitting
Publication year
2021Number of pages
21 p.
Source
Journal of Cognitive Neuroscience, 33, 10, (2021), pp. 2044-2064ISSN
Publication type
Article / Letter to editor
Display more detailsDisplay less details
Organization
SW OZ DCC AI
Journal title
Journal of Cognitive Neuroscience
Volume
vol. 33
Issue
iss. 10
Languages used
English (eng)
Page start
p. 2044
Page end
p. 2064
Subject
Cognitive artificial intelligenceAbstract
Deep neural networks (DNNs) trained on object recognition provide the best current models of high-level visual cortex. What remains unclear is how strongly experimental choices, such as network architecture, training, and fitting to brain data, contribute to the observed similarities. Here, we compare a diverse set of nine DNN architectures on their ability to explain the representational geometry of 62 object images in human inferior temporal cortex (hIT), as measured with fMRI. We compare untrained networks to their task-trained counterparts and assess the effect of cross-validated fitting to hIT, by taking a weighted combination of the principal components of features within each layer and, subsequently, a weighted combination of layers. For each combination of training and fitting, we test all models for their correlation with the hIT representational dissimilarity matrix, using independent images and subjects. Trained models outperform untrained models (accounting for 57% more of the explainable variance), suggesting that structured visual features are important for explaining hIT. Model fitting further improves the alignment of DNN and hIT representations (by 124%), suggesting that the relative prevalence of different features in hIT does not readily emerge from the Imagenet object-recognition task used to train the networks. The same models can also explain the disparate representations in primary visual cortex (V1), where stronger weights are given to earlier layers. In each region, all architectures achieved equivalently high performance once trained and fitted. The models' shared properties—deep feedforward hierarchies of spatially restricted nonlinear filters - seem more important than their differences, when modeling human visual representations.
This item appears in the following Collection(s)
- Academic publications [246515]
- Electronic publications [134102]
- Faculty of Social Sciences [30494]
- Open Access publications [107633]
Upload full text
Use your RU credentials (u/z-number and password) to log in with SURFconext to upload a file for processing by the repository team.