Revealing Vision-Language Integration in the Brain with Multimodal Networks

We use (multi)modal deep neural networks (DNNs) to probe for sites of multimodal integration in the human brain by predicting stereoencephalography (SEEG) recordings taken while human subjects watched movies. We operationalize sites of multimodal integration as regions where a multimodal vision-language model predicts recordings better than unimodal language, unimodal vision, or linearly-integrated language-vision models. Our target DNN models span different architectures (e.g., convolutional networks and transformers) and multimodal training techniques (e.g., cross-attention and contrastive learning). As a key enabling step, we first demonstrate that trained vision and language models systematically outperform their randomly initialized counterparts in their ability to predict SEEG signals. We then compare unimodal and multimodal models against one another. Because our target DNN models often have different architectures, number of parameters, and training sets (possibly obscuring those differences attributable to integration), we carry out a controlled comparison of two models (SLIP and SimCLR), which keep all of these attributes the same aside from input modality. Using this approach, we identify a sizable number of neural sites (on average 141 out of 1090 total sites or 12.94%) and brain regions where multimodal integration seems to occur. Additionally, we find that among the variants of multimodal training techniques we assess, CLIP-style training is the best suited for downstream prediction of the neural activity in these sites.

Multimodal Integration in the Brain Overview. (A) We parse the stimuli, movies, into image-text pairs (which we call \emph{event structures}) and process these with either a vision model, text model, or multimodal model. We extract feature vectors from these models and predict neural activity in 161 25ms time bins per electrode, obtaining a Pearson correlation coefficient per time bin per electrode per model. We exclude any time bins in which a bootstrapping test (computed over event structures) suggests an absence of meaningful signal in the neural activity target in that bin. We run these regressions using both trained and randomly initialized encoders and for two datasets, a vision-aligned dataset and language-aligned dataset, which differ in the methods to sample these pairs. (B) The first analysis of this data investigates if trained models outperform randomly initialized models. The second analysis investigates if multimodal models outperform unimodal models. The third analysis repeats the second holding constant the architecture and dataset to factor out these confounds. A final analysis investigates if multimodal models that meaningfully integrate vision and language features outperform models that simply concatenate them.

Data Overview Our dataset consists of stereoelectroencephalography recordings while subjects watch movies. (a) The electrode placements over all subjects. Each yellow dot denotes an electrode collecting invasive field potential recordings for further analysis in our experiments. (b) An overview of our data collection procedure. Subjects are presented feature length films while neural data is collected from these electrodes in the brain.

Multimodal Integration Sites. We visualize our identified sites of multimodal integration accordings to 5 tests of multimodality. Multimodal sites aggregated into regions from the DKT atlas. For each site we compute the percentage of multimodal electrodes using the first test and the (left) language or (right) vision alignment. The top defines a site to be multimodal if the best model that explains that electrode is multimodal as opposed to unimodal. The bottom controls for architecture, parameters, and datasets by comparing SLIP-Combo and SLIP-SimCLR. Red regions have no multimodal electrodes. Regions which have at least one electrode that is multimodal both with the vision and language aligned stimuli are marked with a blue star. We notice that many electrodes occur in the temporoparietal junction with a cluster in the superior temporal cortex, middle temporal cortex, inferior parietal lobe, etc. Other areas we identify include the insula, supramarginal cortex, the superior frontal cortex, and the caudal middle frontal cortex.

Best model of multimodal integration. We visualize the individual electrodes that pass our multimodality tests for the language-aligned (top) and vision-aligned datasets (bottom), adding a bold outline to electrodes that pass across both datasets. We color the electrodes by the top-ranked multimodal model that predicts activity in the electrode. We see that models such as SLIP-Combo and SLIP-CLIP often predict activity the best across datasets. We also see that BLIP and Flava are the best architecturally multimodal models.

Trained models beat randomly initialized models. A comparison between pretrained and randomly initialized model performance showing the distribution of predictivity across electrodes. This averages significant time bins per electrode (where the lower validation confidence interval must be greater than zero), for both datasets alignments and for each of our 12 models. Every trained network outperforms its randomly initialized counterpart. Trained networks overall outperform untrained networks. This is true both on average, and for almost every single electrode.

BibTeX

@inproceedings{subramaniam2024revealing,
      title={Revealing Vision-Language Integration in the Brain Using Multimodal Networks},
      author={Subramaniam, Vighnesh and Conwell, Colin and Wang, Christopher and Kreiman, Gabriel and Katz, Boris and Cases, Ignacio and Barbu, Andrei},
      booktitle={Fourty-first International Conference on Machine Learning},
      year={2024}
    }

Revealing Vision-Language Integration in the Brain with Multimodal Networks

Abstract

Data Overview

Multimodal Integration Sites

Best Multimodal Integration Model

Trained vs Randomly Initialized Models

BibTeX