## Abstract

The human cortex encodes information in complex networks that can be anatomically dispersed and variable in their microstructure across individuals. Using simulations with neural network models, we show that contemporary statistical methods for functional brain imaging—including univariate contrast, searchlight multivariate pattern classification, and whole-brain decoding with L1 or L2 regularization—each have critical and complementary blind spots under these conditions. We then introduce the sparse-overlapping-sets (SOS) LASSO—a whole-brain multivariate approach that exploits structured sparsity to find network-distributed information—and show in simulation that it captures the advantages of other approaches while avoiding their limitations. When applied to fMRI data to find neural responses that discriminate visually presented faces from other visual stimuli, each method yields a different result, but existing approaches all support the canonical view that face perception engages localized areas in posterior occipital and temporal regions. In contrast, SOS LASSO uncovers a network spanning all four lobes of the brain. The result cannot reflect spurious selection of out-of-system areas because decoding accuracy remains exceedingly high even when canonical face and place systems are removed from the dataset. When used to discriminate visual scenes from other stimuli, the same approach reveals a localized signal consistent with other methods—illustrating that SOS LASSO can detect both widely distributed and localized representational structure. Thus, structured sparsity can provide an unbiased method for testing claims of functional localization. For faces and possibly other domains, such decoding may reveal representations more widely distributed than previously suspected.

**SIGNIFICANCE STATEMENT** Brain systems represent information as patterns of activation over neural populations connected in networks that can be widely distributed anatomically, variable across individuals, and intermingled with other networks. We show that four widespread statistical approaches to functional brain imaging have critical blind spots in this scenario and use simulations with neural network models to illustrate why. We then introduce a new approach designed specifically to find radically distributed representations in neural networks. In simulation and in fMRI data collected in the well studied domain of face perception, the new approach discovers extensive signal missed by the other methods—suggesting that prior functional imaging work may have significantly underestimated the degree to which neurocognitive representations are distributed and variable across individuals.

## Introduction

Cognitive neuroscience is embracing a network-based view: cognition arises from the propagation of activity over weighted connections among neural populations spanning multiple cortical areas (Sporns, 2011). Information inheres in the similarities among distributed activity patterns rather than the mean activation of local neural populations (Haxby et al., 2014). This raises a statistical challenge for functional neuroimaging, where various technologies yield thousands of measurements per second: cognitive structure must be encoded jointly over some subset of measurements, but the number of possible subsets is prohibitively large. How can the theorist find those that encode structure of interest? We propose a new answer motivated by neural network models (Rogers and McClelland, 2014) and show that it can lead to dramatically different conclusions about the neural bases of cognition even in well studied domains.

Neural network models propose that cognitive representations are patterns of activity distributed over neural populations or units (Rumelhart and McClelland, 1986). The patterns arise as units communicate their activity through weighted connections that determine the effect of a sending unit on a receiving unit. Pools of similarly connected units function as a representational ensemble that encodes a particular cognitive structure (e.g., phonological, semantic, visual). Network topography is initially specified, but learning shapes the connection weights that generate patterns over ensembles. Cognitive processing arises from the flow of activity through the network. Such models have proven useful for understanding the neural bases of healthy (McClelland et al., 2010; Rabovsky et al., 2018), disordered (Lambon Ralph et al., 2017), and developing (Saxe et al., 2019) cognition, but challenge functional imaging because they suggest that neural systems can encode information in ways that elude many statistical approaches (Fig. 1), as follows:

Unit activations may not be independently interpretable. If units in an ensemble jointly encode information of interest, independent analysis of each via univariate statistics can obscure critical signal.

Neighboring units need not encode the same information in the same way. Adjacent units in a distributed code may express different components of a represented structure or may express the same component through either increased or decreased activation. Thus, spatial averaging within subjects may destroy signal.

Single units can vary arbitrarily in their responses across individuals even when ensembles encode the same structure. Many different weight configurations can compute the same input/output mapping, so a particular unit in a given network can exhibit arbitrary patterns of responding across training runs in the same environment. In this case, response averaging across subjects at a given anatomical location will destroy a signal.

Representational ensembles need not be anatomically contiguous. Units with similar connectivity function as a representational ensemble—responding to the same inputs and contributing to the same outputs—even if situated in different cortical regions. Approaches that analyze different regions independently will miss information distributed in this way.

Fine connectivity varies across individuals. While coarse connectivity is innate, learning tunes individual weights, rendering precise unit-to-unit alignment across individuals impossible. Cross-subject averaging may therefore destroy information even with sophisticated alignment techniques.

To understand how contemporary methods address these challenges, Study 1 used a neural network model to generate simulated neuroimaging data. The model specifies which units carry information, allowing comparison of methods in their ability to discover different kinds of signal. Each has critical blind spots that suggest a new approach based on structured sparsity (Huang et al., 2011), developed in Study 2. Study 3 compares the different approaches when applied to functional magnetic resonance imaging (fMRI) data in a domain that has highlighted fundamental questions about neurocognitive representation: visual face processing. Each yields a different results, but established approaches all suggest that face processing arises within posterior temporal and occipital regions. The new approach uncovers a radically distributed network spanning all four lobes of the brain. The Discussion considers implications of these results for claims of functional localization from brain imaging data more broadly.

## Study 1: assessing established approaches in simulation

The goal of Study 1 was to compare established statistical methods in their ability to discover representational signal encoded by units in a neural network under different assumptions about the nature of the neural code and the spatial (anatomic) layout of the units. To this end, we generated simulated imaging data from the simple auto-encoder network shown in Figure 2*A*. The model was trained 10 times under identical conditions except for weight initialization to reproduce 72 input patterns over the output units. Items were sampled equally from two categories (A and B) corresponding to some cognitive distinction of interest (e.g., faces vs places). Each trained model is analogous to an individual in an fMRI study, with the BOLD signal at a single voxel simulated as the activation of the unit perturbed by independently sampled noise.

Two subsets of units encode category information in different ways. Informative input/output (IIO) units adopt an independent code: each is weakly but reliably more active on average for items in one category (A or B) than the other. Informative hidden (IH) units connect informative input to output units, and thus learn a distributed code: after training, items from the same category always evoke similar patterns across units, but individual units vary arbitrarily in their independent correlations with category structure within and across networks. The model also includes arbitrary input-output and arbitrary hidden (AH) units that respond to stimuli but do not encode category information, and irrelevant units that take low random values.

We considered two anatomic layouts for the network (Fig. 2*A*), corresponding to two different assumptions about how units coding a distributed representation can be spatially organized. Both layouts situated the IO units of a given type (A, B, arbitrary) within a contiguous spatial region localized in the same way across model individuals, analogous to early perceptual areas known to have the same topographic organization across individuals. The localized layout also grouped hidden units by type (IH, AH, irrelevant) in this way (localized identically across model individuals) consistent with the common view that layers in a neural network model correspond approximately to contiguous cortical regions in the brain. The dispersed layout arranged hidden units in four anatomically distal “regions” containing a mix of IH, AH, and irrelevant units, with unit locations shuffled within region for each model subject. This condition represents the possibility suggested by neural network models that neural populations jointly contributing to a representation may be anatomically distal from one another, somewhat variable in their spatial location across individuals, and intermingled with populations contributing to other representations. All models had the same connectivity—the layouts differed only in hypothesized spatial locations of the units.

We then applied the following four common statistical methods to find units that encode the A/B category structure: univariate contrast (Friston et al., 1994), searchlight multivariate pattern classification (MVPC; Kriegeskorte et al., 2006; Pereira et al., 2009), and whole-brain pattern classification (Lemm et al., 2011) regularized with either the L1 (Tibshirani, 1996) or the L2 (Hoerl and Kennard, 1970) norm. A good method should detect all informative units regardless of spatial layout and should indicate the code direction: more active for A versus B for IIO units and the heterogeneous code for IH units.

### Methods

Simulations were conducted using the Light Efficient Network Simulator (Rohde, 1999) with minor revisions to support modern Tcl/Tk libraries (https://github.com/crcox/lens/releases/tag/v1.0) for software and networks files. The model was trained using backpropagation to minimize cross-entropy error with a learning rate of 0.1, a momentum (“Doug's”) of 0.9, and weight decay of 0.001. The model achieved near-zero error with 1000 weight updates, each following a batch containing all 72 patterns. The model was fit 10 times with different initial weights sampled from a uniform distribution in [−1,1] to simulate 10 individuals in a brain imaging study. We computed the activation of every unit for each item in each model, then distorted this with i.i.d. (independent and identically distributed) Gaussian noise (mean = 0, SD = 1) to simulate the BOLD response to a stimulus at each voxel. The simulated activation patterns were extended with 28 irrelevant units (zero baseline activity and the same noise profile), which serve two important functions. Conceptually, they represent neural populations that are independent from the representations of interest and not involved with the task. Statistically, they allow us to evaluate the false alarm rate of our methods and construct tests of reliable unit selection.

### Statistical analysis

Separate analyses were conducted for the localized and dispersed layouts. All methods used the same uncorrected but conservative criterion for significance (α = 0.002).

Univariate contrast spatially smooths data, then identifies voxels whose mean activation across individuals at a given spatial location reliably differs for A versus B items (Friston et al., 1994). Smoothing used a boxcar average over a three unit window. Analysis then involved a two-tailed independent-samples *t* test (A vs B items) at each unit.

Searchlight MVPC seeks distributed representations by generating information maps across the cortex (Kriegeskorte et al., 2006; Pereira et al., 2009). At each voxel in each subject, a classifier is trained to discriminate items from different categories, based on the neural response evoked in neighboring voxels within a “searchlight” of fixed radius. Holdout accuracy is stored at the center voxel, and univariate tests across subjects then indicate which searchlights perform reliably above chance. Thus, information must be expressed within the searchlight radius and localized similarly across individuals. The model analysis leveraged the SearchMight toolbox (Pereira and Botvinick, 2011) for MATLAB and involved centering a searchlight of fixed radius on each unit. A support vector machine classifier was fit to unsmoothed unit activity within each searchlight with sixfold cross-validation. Unit groups were padded with noise units so that all searchlights contained the same number of units and a searchlight never encompassed units in different “regions.” The mean cross-validation accuracy was stored at each searchlight center, and the mean accuracy over model subjects at each unit was compared against chance using a two-tailed *t* test. We assessed the performance of searchlights ranging in radius from 3 to 14 units and report the best performing searchlight (size 7) together with the smallest and largest.

### Whole-brain MVPC with regularized regression

This approach fits and evaluates a single classifier per subject using all voxels and avoids overfitting by finding classifier coefficients β that jointly minimize prediction error and a regularization penalty (Lemm et al., 2011), as follows:
(1) where *f*(β) is the classifier prediction error, *h*(β) is the regularization penalty, and λ ∈ [0,1] is a free hyperparameter that scales the error versus regularization costs. The approach does not consider voxel location at all and so in principle can find nonlocal information. We used logistic loss for the error and considered two common regularization penalties: ridge regression, which increases with , and LASSO, which increases with .

Analyses were conducted using the Whole-Brain Imaging with Sparse Correlations (WISC) workflow (https://github.com/crcox/WISC_MVPA/releases/tag/FirstMajor). The same procedure was used for all regularized regression analyses, so we will describe it in detail here.

Each analysis involves two rounds with different aims. In the performance round, the aim is to assess how accurately a decoding model, fit to a subset of data, can then classify held-out test items. In this evaluation, it is important that the data used to fit model parameters and hyperparameters are independent of those used to evaluate the fitted model. To this end, we adopted the nested cross-validation procedure shown in Figure 2. Data were divided into *n* partitions. One partition was held out as a final test set. The remaining partitions were used as a training set to find good model parameters in an inner cross-validation loop. In the inner loop, a series of models was fit to 80% of the training data using different values for λ, with each evaluated on a held-out 20% of the training data, and iterating over different holdout sets. This loop searched many hyperparameter values, seeking the one that yields the best mean accuracy across the inner-loop holdouts. A final model is then fit to all training data at the selected λ, and this model is evaluated for accuracy on the final completely held-out test set. This procedure is conducted for each of the *n* partitions, yielding *n* estimates of decoding accuracy on completely held-out items; the expected decoding accuracy is then taken as the mean of those *n* estimates.

While the performance round indicates the expected accuracy of a decoding model, it does not provide a clear picture of which features the model exploits, since the procedure fits *n* different decoding models (one for each partition), each specifying its own set of coefficients. We therefore conducted an importance mapping round with the aim of assessing which features (units or voxels) a decoding model selects as “important” for classification. In this round, we select the hyperparameter that yielded the highest test accuracy in the performance round, then fit a decoding model to all data using the selected hyperparameter. This yields a single decoding model for each subject, which specifies one coefficient for each feature. This decoder was not evaluated for classification accuracy (since it was fit to all data) but was analyzed to assess which features reliably attract large or nonzero coefficients for a decoding model fit at the optimal hyperparameter.

For LASSO, it is straightforward to determine which features are selected: the regularizer pressures as many coefficients to 0 as possible, so any feature with a nonzero coefficient has been selected. To determine which features are selected in true data more often than expected from random data, we conducted a permutation analysis: the full procedure was run for each model subject on 1000 permutations, identical in all respects to the analysis above but with the category labels shuffled randomly each time. For each permutation, on every unit, we count how often the unit was selected across the 10 model subjects, providing a null distribution for probability of overlap across model subjects. We then count a unit as being reliably selected if the overlap across model subjects observed in the true data is higher than that observed in the permutation distribution with *p* < 0.002.

For ridge regression, all features always receive nonzero coefficients, so it can be unclear how to determine which are “selected” by the classifier. We adopted the common approach of treating units with large coefficients—specifically, those with an absolute value in the top quartile for each model—as having been selected by the classifier. To assess which units reliably attracted large coefficients across model runs, we tabulated, for each unit, the number of times its coefficient was in the top quartile by magnitude across model subjects, then computed the binomial probability of achieving this number or greater from 10 model runs given the base probability of landing in the top quartile by chance (0.25). We counted a unit as being reliably selected when this probability was <0.002.

The performance-round and importance-mapping round were conducted for L1 and L2 approaches using six data partitions.

### Results

Simulation results are shown in Figures 3 and 4. We found the following. Univariate contrast (UC) uncovered the independent code and code direction in IIO units but missed the distributed code over IH units in both layouts. Searchlight MVPC, at the best performing radius, discovered half of the independent code in IIO units and almost all of the distributed code in the localized layout, but missed the distributed code in the dispersed layout and always obscured the code direction for individual units (since information maps only reveal classifier accuracy). Smaller searchlight sizes performed similarly, while larger sizes obscured the distributed code even in the localized layout. Whole-brain MVPC with regularized regression showed qualitatively different results depending on the regularization penalty (Fig. 4*B*). With the L2 norm (ridge regression), the classifier showed above-chance holdout accuracy (mean = 59.2%, *t*_{(9)} = 5.36, *p* < 0.001) but placed nonzero coefficients on all units, making it difficult to discriminate informative from arbitrary and irrelevant units. The strategy of selecting units with large coefficients did not identify signal-carrying units, which were no more likely to attract large weights in the classifier than expected by chance.

Regularizing with the L1 norm produced a classifier with equally good holdout accuracy (mean = 57.2%, *t*_{(9)} = 2.34, *p* < 0.05; Ridge vs LASSO paired, *t*_{(9)} = 0.626, n.s.) and a much sparser solution. Only three units (two IH and one IIO) reliably received nonzero coefficients, with no false alarms. Thus, conventional regularization either missed substantial signal (L1) or selected everything (L2).

## Study 2: whole-brain MVPC with structured sparsity

We have suggested that structured sparsity (Jacob et al., 2009; Jenatton et al., 2011) can provide an avenue for preserving the strengths of established methods while avoiding their weaknesses. On this approach, a single classifier is fit to data from all subjects while a regularization penalty encourages desired sparsity patterns among the coefficients. In functional imaging, the solution should (1) clearly delineate selected and unselected voxels, (2) allow heterogeneous codes among neighboring units within and across individuals, (3) reveal code direction where this is consistent, (4) identify distal units that jointly express representational structure, (5) capitalize on shared location across subjects, where this exists, but also (6) tolerate individual variation in signal location.

In prior work, we defined the sparse-overlapping-sets (SOS) LASSO to meet these criteria (Rao et al., 2013, 2016). Voxels from all subjects are projected into a common reference space without interpolation or averaging. Sets are defined for grid points in the space, each encompassing all voxels within and across subjects that fall within a specified radius. Sets are analogous to searchlights in many respects; each includes all voxels within a spatially contiguous radius of a center point. As with searchlights, each voxel belongs to several sets, and sets overlap in the voxels they contain. The central difference lies in how this information about spatial grouping is used in model fitting. Rather than fitting a different decoder to each searchlight/set independently, all sets instead contribute simultaneously to the regularization cost *h* in Equation 1 as follows:
(2) where *S* defines the grouping of voxels into sets, *i* indexes model coefficients within a set, and γ ∈ [0,1] is a free parameter. The total cost is a sum over sets. The cost for each set is the proportional weighted sum of the LASSO sparsity penalty and a grouping penalty formulated as the root of the sum of squared coefficients. Because this root is taken over units within a set, the penalty is smaller when nonzero coefficients occupy the same set than when they occupy different sets (Rao et al., 2013). Thus, SOS LASSO (https://zenodo.org/record/3609239) encourages sparse solutions where selected voxels tend to occupy the same sets. The free hyperparameter γ controls the relative weighting of the sparsity versus grouping penalties within set. When γ = 0, the penalty reduces to LASSO. The optimization is convex and returns a unique solution for a given pair of hyperparameters (γ and λ). These are tuned via nested cross-validation on holdout error just as described for L1 and L2 regularization (Fig. 2).

SOS LASSO captures the spirit of searchlight analysis in seeking a whole-brain information map that identifies local regions where many signal-carrying voxels reside. The grouping penalty in the SOS LASSO cost function pressures the optimization to place nonzero coefficients in regions where information is localized similarly across participants. Yet, because SOS LASSO fits a single model to all data simultaneously, it can “see” all searchlights at once, and so can capitalize on information that may be distributed across multiple anatomically distal areas but not discernable within any single region (either because local correlations are weak or because information across regions must be combined for accurate decoding). Moreover, because the hyperparameter on the grouping penalty is tuned by data, the approach can down-weight or even ignore set membership if doing so leads to better prediction in the decoder. For instance, if signal-carrying voxels were not anatomically grouped within or across subjects, the optimization would select a value near 1 for the sparsity parameter (and near 0 for the grouping parameter), leading to a sparse solution that can be completely different in each participant.

SOS LASSO also has one property in common with L1 and L2 regularization that contrasts with searchlight MVPC: it returns a single whole-brain classifier for each individual subject. The holdout accuracy of this single classifier can be evaluated against chance to assess whether reliable signal has been found in each subject individually, without punishing correction for multiple comparisons and without assumption about homogeneity of signal location across subjects. While it is technically possible to conduct single-subject analyses using the searchlight procedure, this approach requires thousands of statistical tests in each participant to evaluate the classification accuracy of each searchlight. Thus, accuracy must be exceedingly high to survive correction for multiple comparisons. Consequently, the most common searchlight application involves testing the mean classification accuracy across subjects against chance at each voxel/location, which in turn relies on the assumption that important signal is localized similarly across individuals.

In these ways, SOS LASSO captures strengths while avoiding the limitations of other multivariate approaches. We also note that the approach is related to, but distinct from, other sparsity-based approaches that have been applied to neural decoding. The elastic net regularizes a whole-brain classifier with the weighted sum of the L1 and L2 norms (calculated across all model coefficients; Zou and Hastie, 2005). This is closely related to the regularization cost per set in SOS LASSO, but the set-wise evaluation of the cost has an important consequence: since the full cost is a sum over sets, the optimization seeks to have as many “empty sets” (all coefficients zero) as possible. Thus, SOS LASSO enforces a hard distinction between selected and unselected voxels (those with/without a nonzero coefficient). In contrast, elastic net, like ridge regression, often places small nonzero values on all features, and there exists no principled way of deciding which are important and which not. Our approach is also related to, and in fact generalizes, the overlapping group lasso, a common approach to multitask learning (Jacob et al., 2009) in which all features (e.g., voxels) in a selected group are constrained to have the same coefficient, and sparsity patterns are predefined. Our formulation allows different coefficients within set (so that solutions can vary across individual participants) and does not require a prespecified sparsity pattern. These relationships, together with mathematical analysis of the regularizer, are explained further in the studies by Rao et al. (2013, 2016).

### Computational efficiency

It is worth noting that SOS LASSO demands considerably more computational resources than simple regularization with the L1 or L2 norm, for two reasons. First, it requires search over two independent hyperparameters, squaring the number of iterations that must be conducted during the inner loop of the nested cross-validation. Second, because the approach conducts a single optimization on data from all subjects simultaneously, then the full dataset must be transferred to each worker node for each model fit—that is, the approach cannot conduct separate analyses on each participant dataset in parallel. This exerts significant memory and data transfer demands. Yet, much of the workflow we have described is embarrassingly parallel—for instance, all steps of the inner loop cross-validation, including model fitting for each combination of hyperparameters on every fold, can be conducted in parallel on a high-throughput network such as Open Science Grid. Thus, while the analyses we report would require years of computing time on a single workstation, they typically complete overnight on the large HTCondor system deployed at the University of Wisconsin-Madison.

### Statistical analyses

Decoding model data with the SOS LASSO was implemented using the same WISC workflow, cross-validation procedures, and permutation testing described for regularized regression in Study 1. To fit the SOS LASSO decoding models, each unit was assigned to one or more sets, with each set containing 14 consecutive units (analogous to searchlight radius 7 in this one-dimensional dataset), and with seven units overlapping between adjacent set pairs. The grouping and sparsity hyperparameters (γ and λ) were tuned using nested cross-validation with 10 data partitions. Decoding accuracy was evaluated in a performance round as described earlier, while feature selection was evaluated in an importance mapping round, also as described for Study 1. As with LASSO, the importance-mapping procedure returned a single decoder for each model “subject,” with many coefficients pressured to zero. Thus, we again treated any unit in any model receiving a nonzero coefficient as having been selected by the method and used permutation testing to assess which units are selected in true data more often than expected from random data.

### Results

Applied to model data, SOS LASSO achieved the best holdout accuracy (68.3%), significantly better than the next best method (LASSO, paired *t*_{(9)} = 4.869, *p* < 0.001) while uncovering much of the independent code and almost all of the distributed code in both layouts (Fig. 4*C*). The sign of the classifier coefficients revealed the consistent code direction in IIO units and code heterogeneity in IH units. SOS LASSO alone discovered the anatomically dispersed distributed signal, while preserving the strengths of the other methods.

## Study 3: finding distributed representation of faces in neural data

The simulations suggest that functional brain imaging studies may have missed important distributed signal even in well studied domains where multiple contemporary approaches have been applied. Study 3 assessed this possibility for visual face perception. We applied univariate contrast, searchlight MVPC, LASSO, and SOS LASSO to fMRI data collected in an unrelated study (Lewis-Peacock and Postle, 2008) where 10 participants evaluated 90 images depicting people, scenes, or objects (30 each). The study used a slow event-related design estimating the peak BOLD response to each image at each voxel without deconvolution. We applied each method to these data to find voxels whose activations discriminate face (people) from nonface (place and object) stimuli.

### Methods

The fMRI data (Lewis-Peacock et al., 2018) were collected by a different group in an independent study that reports the full methodology and image acquisition details (Lewis-Peacock and Postle, 2008). In a slow event-related design, subjects viewed 30 images from each of three categories (celebrities, famous locations, objects) in permuted order while their brains were scanned with fMRI. Images were acquired with a gradient echo, echoplanar sequence with 2000 ms repetition time (TR) and 50 ms echo time to acquire data sensitive to the BOLD signal within a 64 × 64 matrix (30 axial slices coplanar with the T1 acquisition, 3.75 × 3.75 × 4 mm). With button press on a 4 point Likert scale, they indicated their liking for the celebrity/location or familiarity with the object. Each trial consisted of a cue (2 s), stimulus (5 s), and judgment period (3 s) followed by an arithmetic task (16 s) to reduce interference between trials. Each stimulus appeared once. Functional data for each subject were masked to exclude noncortical voxels. The response to each stimulus at every voxel was taken as the BOLD signal recorded at the fifth TR following stimulus onset, without time-series deconvolution.

### Univariate analysis

Functional images from the fifth TR following stimulus onset were projected to Talairach space using the T1 data with a combination of manual landmark identification (anterior and posterior commissures) and automated affine transformation obtained using 3dvolreg in AFNI (Analysis of Functional NeuroImages). The response to each stimulus was smoothed with 4 mm FWHM Gaussian kernel and downsampled to the original resolution. At each voxel, we computed the mean response for face and nonface stimuli for each subject. In a whole-brain analysis, we tested for a group-level difference between these means at each voxel with a two-tailed dependent-samples *t* test (clusterwise α = 0.05). In a region of interest (ROI) analysis, we computed, for each subject, the difference in BOLD response to faces versus nonfaces averaged across voxels in the “right fusiform face area” (rFFA) mask from the study by Julian et al. (2012), then conducted a one-tailed *t* test (faces > nonfaces) against zero across subjects.

### Searchlight MVPC

We used SearchMight (Pereira and Botvinick, 2011) with a linear support vector machine classifier to generate native-space information maps for each subject using both a standard 9 mm and a larger 12 mm searchlight. Holdout accuracy for each searchlight was measured as the average of the hit and correct rejection rates, a metric with range [0,1] and chance performance of 0.5, corresponding to 50% accurate categorization, regardless of the number of items in each category. Throughout, we refer to accuracy in percentage rather than proportional terms. Data for all subjects were projected to Talairach space, spatially smoothed, and downsampled to native resolution as in the univariate analysis. A one-tailed *t* test against 0 (true positives > false positives) was conducted on the mean of this metric across subjects (clusterwise α = 0.05).

For whole-brain MVPC with L1 regularization, we evaluated holdout accuracy separately for each subject in a performance round using nested 10-fold cross-validation exactly as described for the simulation (Fig. 2). For each subject, the model was fit using all cortical voxels without any ROI or voxel preselection. This analysis allowed us to assess how accurately a classifier fit with L1 regularization could discriminate faces from nonfaces in each subject considered independently.

We then assessed whether decoding models across participants place nonzero coefficients in common regions using the same procedure used in the simulation. For each subject, we fit a single decoding model to all data, using the best performing hyperparameter discovered in initial assessment of decoding accuracy. This final model returned a single coefficient at each cortical voxel for every subject (with, of course, many coefficients set to 0). The classifier coefficients for each subject were projected to Talairach space with linear interpolation, smoothed with a 4 mm FWHM Gaussian kernel, and downsampled to the original resolution. We then counted how often each common-space voxel was selected across the 10 subjects. To statistically threshold this group map, we conducted a permutation test in which the identical procedure was followed, but with stimulus labels randomly permuted. From 1000 permutations, we estimated the base probability of selection under the null hypothesis for each voxel, then used this probability and the binomial distribution to threshold the group map of the true data at *p* < 0.002 uncorrected. For instance, if the probability of selection from permutations was 0.15, the voxel would exceed threshold if, in the true data, it was selected in ≥6 of the 10 participants (binomial probability, ∼0.0014).

SOS LASSO model fitting followed a similar procedure. To estimate decoding accuracy, we fit decoding models to all subject data simultaneously using the SOS LASSO optimization. Voxels were spatially aligned within Talairach coordinates without blurring or linear interpolation (i.e., we simply adjusted the spatial coordinates of each voxel based on the subject-to-common-space affine transform). Voxels within and across subjects were then assigned to multiple overlapping sets by tiling Talairach space with 18-mm-diameter cubes, overlapping by 9 mm along each axis. This diameter was chosen to be comparable to the 9 mm radius commonly used in searchlight. We note, however, that SOS LASSO is less sensitive to the group size than searchlight because (1) groups overlap, (2) voxels can be selected from multiple groups simultaneously, and (3) solutions can be sparse within groups. We again applied nested 10-fold cross-validation to estimate decoding accuracy in each subject. To determine whether selected voxels accumulate in similar regions across subjects, we conducted an importance-mapping round as described earlier, yielding a single decoding coefficient for each voxel in every subject. To visualize the spatial/anatomic distribution of these coefficients across subjects, we flagged selected voxels with a binary mask (1 for voxels with a nonzero coefficient, 0 otherwise), then spatially smoothed the mask with a truncated 4 mm FWHM Gaussian kernel.

Because SOS LASSO fits all subjects simultaneously, selection of a voxel in one subject alters the probability of selection in a nearby region in other subjects. Thus, the binomial assumption of independence across subjects adopted to threshold the LASSO analysis is violated in the SOS LASSO case. We instead adopted a stricter criterion for significance based on permutation sampling. For each permutation round, the entire SOS LASSO pipeline was conducted using randomly shuffled category labels in the model fit. Selected voxels were subjected to a binary threshold and spatially smoothed in the identical way. For each voxel in the common space, we then counted how many subjects received a nonzero value in the smoothed map after applying the threshold. Across 1000 such analyses with randomly permuted labels and ∼10,000 common-space voxels, no voxel was ever selected in more than seven subjects. We therefore masked the SOS LASSO maps to show voxels selected in eight or more subjects in the true data. Since all permutations are independent, this establishes a lower significance bound of *p* < 0.001 uncorrected. All follow-up analyses with SOS LASSO used the same procedure.

### Results

Figure 5 shows canonical “face” and “place” systems from the study by Julian et al. (2012) together with results for the current data from univariate contrast and searchlight methods. The univariate approach revealed significantly less activation for faces around parahippocampus bilaterally (*p* < 0.05 cluster corrected), while the ROI analysis found greater activation for faces in the right rFFA (*p* < 0.05). Searchlight MVPC with a 9 mm radius identified areas near the FFA bilaterally (*p* < 0.05 cluster corrected) and in lateral occipitotemporal regions of both hemispheres, while a 12 mm radius identified similar areas with a broader anatomic spread (*p* < 0.05 cluster corrected). As noted for simulations, this approach was unable to reveal code direction.

Results for whole-brain decoding with sparse regularization are shown in the top panel of Figure 6. The decoding models fit with LASSO and SOS LASSO showed equally high holdout accuracy in the performance round (LASSO, 88.6%; SOS, 86.8%; *t*_{(9)} =1.23, SE = 0.015, n.s. for within-subjects contrast). Analysis of final model coefficients for LASSO yielded a group result similar to searchlight: voxels selected in the group map more often than expected from permutations resided near FFA bilaterally, in right lateral occipitotemporal cortex, and small area in left occipitotemporal cortex (*p* < 0.002 uncorrected). In contrast with searchlight, the classifier coefficients indicated a consistent topographic code, with faces predicted by reduced activity around parahippocampus and elevated activity in lateral ventrotemporal regions of both hemispheres and in right occipitotemporal regions. Thus, these three approaches suggest similar but nonidentical conclusions about face representation in cortex, each with precedent in prior work: the UC result suggests that faces selectively activate the right FFA (Kanwisher et al., 1997); searchlight identifies localized bilateral signal around the FFA (Rivolta et al., 2014); and LASSO reveals a nonface-to-face gradient bilaterally in these regions plus a right-lateralized occipital face region (Martin, 2007).

Results from SOS LASSO differed strikingly. Colored areas in Figure 6 (top, bottom row) show regions consistently selected across participants more often than occurred in permutation testing. The hue depicts the proportion of participants receiving a positive coefficient in the corresponding region; positive coefficients signify that elevated BOLD response is associated with higher probability that the stimulus is a face. In addition to the canonical face and place systems, the results implicate regions spanning anterior temporal, frontal, and parietal cortex. Face stimuli were predicted by increased activity throughout the canonical face system, bilaterally in temporal poles and superior medial frontal areas, and in right inferior prefrontal cortex; and by decreased activity in parahippocampus, the dorsal visual stream, and left premotor cortex. Heterogeneous codes appeared in left occipito-temporo-parietal and lateral prefrontal cortices, and bilaterally in posterior fusiform.

How is the stark difference between SOS LASSO and other solutions to be interpreted? One possibility is that the approach exploits real signal within the canonical occipitotemporal face and place systems, but also places nonzero coefficients on voxels outside this system that do not carry actual signal—perhaps because they suppress correlated noise within the signal-carrying system (Henriksson et al., 2015) or for some other spurious reason. If that were so, the distributed-seeming signal revealed by SOS LASSO would be highly misleading. We therefore conducted two follow-up analyses to assess whether SOS LASSO can detect real signal outside of the canonical face and place systems.

In the first, we divided the data into the following two sets: a within-system set, defined as the canonical face and place system ROIs dilated by 7 mm and then eroded by 5 mm with AFNI 3dmask_tool (effective dilation, ∼3 mm in all directions while filling holes after back-projecting to 3 × 3 × 3.75 mm resolution in each subject with AFNI 3dfractionize); and an out-of-system set containing all remaining voxels. We used SOS LASSO to fit separate decoding models for each data subset. If the whole-brain result succeeds solely by decoding information contained within the canonical face and place systems (and selects out-of-system voxels for spurious reasons), then the within-system decoder should show better holdout accuracy than the out-of-system decoder. Instead, the reverse result was obtained: out-of-system classifiers showed significantly higher holdout accuracy (87%) than within-system classifiers (83%; two-tailed within-subjects test: *t*_{(9)} = 2.4, SE = 0.015, *p* < 0.05).

To address the possibility that good classification accuracy was still driven solely by signal within a temporo-occipital network (Haxby et al., 2000), we next used SOS LASSO to fit a decoding model using voxels in the parietal and frontal lobes only. Holdout classification accuracy remained high (85.52%). The analogous analysis with LASSO achieved 82.21% accuracy, clearly well above chance but reliably worse than SOS LASSO (two-tailed within-subjects test: *t*_{(9)} = 2.4, SE = 0.014, *p*= 0.039). Note that LASSO shows this good performance among a group of voxels that whole-brain LASSO did not initially select—clearly indicating that the strong emphasis on sparsity in LASSO can lead it to miss signal-carrying voxels.

Finally, if the widely distributed signal that SOS LASSO identifies arises because of correlated noise suppression or for other artifactual reasons, then a similarly distributed pattern should obtain regardless of the kind of stimulus being decoded. To assess whether this is so, we fit SOS LASSO decoding models using all cortical voxels to discriminate place from nonplace stimuli, applying procedures identical to those described earlier. The resulting models showed high test-set classification accuracy (79.3%), but from a very different distribution of selected voxels. As shown in the bottom panel of Figure 6, voxels discriminating place from nonplace stimuli were anatomically localized in a manner consistent with the canonical view of place representation in the brain (Epstein and Kanwisher, 1998; Epstein et al., 2003). The analysis demonstrates that SOS LASSO can yield anatomically localized solutions, and hence that the widely distributed result for faces is not artifactual. Together, these analyses suggest that the SOS maps differ from those produced by other methods, not because the method selects uninformative voxels, but because it detects reliable signal missed by other approaches.

Finally, we considered to what extent the SOS LASSO behavior depends on a particular choice of hyperparameters. Specifically, we examined the mean decoding accuracy for each pair of hyperparameter values evaluating during the performance round of model fitting, as well as the number of voxels selected for the best performing models selected at each level of the grouping parameter γ. These data are shown in Figure 7.

The left panel of Figure 7 shows that, for each level of the grouping parameter (the different lines), the decoding accuracy first rises as λ increases, then declines to chance (as the weight on the regularizer takes precedence over model accuracy). Each such curve peaks at a similar value, indicating that there are many pairs of parameters [γ, λ] that produce comparable decoding accuracy. The right panel of Figure 7 shows, however, that most such pairs find solutions of comparable sparsity. The plot shows holdout accuracy for the best-performing models at each level of γ, plotted against the number of selected voxels per subject. All models that show decoding accuracy comparable to the best also select a comparable number of voxels (∼300).

## General discussion

Four contemporary approaches to functional image analysis have difficulties discovering distributed signal that are remedied by a new approach based on structured sparsity, the SOS LASSO. All yielded different results when applied to fMRI data collected while participants judged images of faces, places, or objects, but prior approaches supported the common view that face perception engages posterior temporal and occipital cortices. SOS LASSO uncovered a broader network encompassing anterior temporal, frontal, and parietal regions. The result does not solely reflect the selection of spurious voxels: the approach discriminated face from nonface stimuli with high accuracy even when fit only to voxels outside temporal and occipital cortices and yielded a more localized solution when discriminating place from nonplace stimuli. Nor does the result reflect idiosyncrasies of the stimuli or task since the same dataset yielded canonical results when analyzed using established methods. SOS LASSO was the only method capable of finding distributed signal in simulation, and the only one to reveal a radically distributed network for face perception in the brain.

Many aspects of the SOS LASSO result cohere with standard views of face/place perception and the broader literature. Positive coefficients picked out the canonical face system and social-cognitive areas including the temporal poles (Olson et al., 2007), right orbito-frontal cortex (Adolphs, 2002), and superior medial-frontal cortex (Amodio and Frith, 2006). Negative coefficients picked out areas that encode less socially critical information, including scenes (parahippocampal place area; Epstein and Kanwisher, 1998) and object-directed action (dorsal visual stream, left dorsal premotor area; Kalénine et al., 2010). Other regions received mixed coefficients, indicating that distributed patterns can represent stimulus category in a manner that varies within and across individuals. These results echo recent work suggesting that neural representations may be more broadly distributed than heretofore suspected, for face perception specifically (Hanson and Halchenko, 2008; Zhen et al., 2013; Nestor et al., 2016) and for conceptual structure more generally (Huth et al., 2016; Pereira et al., 2018).

The contrasting finding of distributed face versus more localized place representations also accords with recent literature. Whereas early observations of a highly localized face area in right posterior fusiform have given way to an elaborate network of face-relevant regions, the same is not true for place processing in the brain. A recent review from Epstein and Baker (2019) indicates that the place system involves just three anatomically proximal regions—medial ventrotemporal cortex, lateral occipital cortex, and medial occipital cortex—all of which are encompassed in the anatomically localized region identified by SOS LASSO when decoding places.

The contrasting results have two important implications for hypotheses about neural representation. First, even large (12 mm) searchlights missed the widely distributed signal, indicating that it does not reside within local cortical regions considered independently. Thus, anatomically distal regions can jointly encode multivariate representational structure. Second, SOS LASSO assigned consistently positive or negative coefficients in regions where UC yielded a null result. Whereas UC considers each voxel independently, classifier coefficients indicate how the activation of a voxel contributes to classification when combined with those of other voxels. The contrast indicates that voxels can jointly contribute to representational structure in a directionally and locationally consistent manner even when they do not appear to independently correlate with that structure. In both cases, the joint contribution of multiple voxels or regions to representational structure may arise simply through the linear combination of correlations that are individually too weak to be detected by univariate methods, or because the contribution of one voxel or region to structure can be “masked” by the effects of others (e.g., in partial correlation). As a linear model that does not consider interactions, SOS LASSO cannot adjudicate these possibilities yet; but in either case it is clear that the joint consideration of multiple regions in parallel can uncover a broader structure than the analysis of each region independently.

In addition to core posterior temporal and occipital areas, maps of the extended face network often include anterior temporal cortex, inferior frontal cortex, and amygdala (Haxby et al., 2000; Ishai, 2008; Kanwisher and Barton, 2011; Avidan et al., 2014; Duchaine and Yovel, 2015). Our results with SOS LASSO suggest an even broader network in which face perception generates elevated activity in social-cognitive networks overlapping with the extended face network, reduced activity in networks relevant to navigation and object-directed action, and heterogeneous patterns in parietal and frontal cortices. Elements of the full pattern vary in the direction, independence, and localization of their code, so previous methods each provide only a partial view. Their agreement with canon arises because core and extended face systems comprise parts of the pattern discoverable via multiple methods—regions where information is encoded efficiently and independently within circumscribed cortical regions, often in a directionally consistent manner localized similarly across subjects.

### Relationship to prior work

The relationship between representations acquired by neural network models and patterns of activation measured in the brain is currently an active area of research, especially in visual neuroscience (Kriegeskorte and Kievit, 2013; Cox and Dean, 2014; Khaligh-Razavi and Kriegeskorte, 2014; Clarke et al., 2015; Kriegeskorte, 2015; Cichy et al., 2016; Kheradpisheh et al., 2016; Marblestone et al., 2016). The most common approach compares the similarity structure arising in different layers of a neurocomputational model to those arising in different regions of the ventral visual processing stream, using correlation or other unsupervised methods. Such efforts have revealed important and remarkable similarities between the representations observed in real brains and those acquired by, for instance, deep convolutional image classifiers (Kriegeskorte, 2015). Like our work, these contributions illustrate how artificial neural network models provide a useful tool for bridging neural and computational levels of explanation for cognitive phenomena. Focusing on individual network layers and localized cortical regions, however, neglects the critical possibility suggested by network models that representation and processing can be distributed across anatomically distal regions. The current work illustrates not only how such a signal can be recovered in principle, but that radically distributed representations of this kind arise in human cortex.

The work also elaborates the evolving distributed view of visual object representations in the brain. Early perspectives emphasized a highly modular and localized view in which different brain regions were dedicated via evolutionary pressures to representing distinct categories of objects (Kanwisher et al., 1997; Caramazza and Shelton, 1998; Epstein and Kanwisher, 1998; Kanwisher, 2000; Downing et al., 2001). This began to change with the pioneering application by Haxby et al. (2000) of unsupervised multivariate methods to analysis of evoked patterns of activation in the ventral processing stream, which revealed a more distributed code localized within posterior ventral temporal cortex. As new multivariate methods have developed, many studies have argued for distributed visual object representations across occipital and posterior temporal regions (Haxby et al., 2001; Kriegeskorte et al., 2008b; Connolly et al., 2012), with some extending the face system up to anterior temporal cortex (Kriegeskorte et al., 2007; Nestor et al., 2008; Tsao et al., 2008; Rajimehr et al., 2009; Nestor et al., 2011; Collins and Olson, 2014). Our work continues the general trend to suggest that face representations are distributed even more broadly, across temporal, parietal, and frontal lobes in both hemispheres.

Zhen et al. (2013) reported a distributed network for face processing that spans all four lobes of the brain, making it, to the best of our knowledge, the one prior study that reports a pattern of results that resembles our own. Notably, they conducted a group constrained subject specific (GSS) univariate analysis (Fedorenko et al., 2010). While this study provides an important external validation of our results, it is worth noting a key methodological difference. Because GSS is univariate, it cannot detect information jointly encoded across voxels or regions; and because tests of significance are based solely on cross-group probabilities, it relies on common localization of signal across individual participants. As we have emphasized, SOS LASSO can exploit joint information across voxels, and returns an information map for each participant. Thus, while it is possible to inspect cross-subject consistency in location in a group map as we have done, the method does not *require* signal to be localized similarly across subjects. It therefore becomes an empirical question whether the signal-carrying voxels across subjects reside in similar anatomic locations. Perhaps because of these differences, we found the radically distributed face code with many fewer subjects (they included 42, 10 of whom were scanned seven times each) much more consistently (in 80% of participants vs a maximum of 25%), and in key regions not observed by Zhen et al. (2013) but known to be involved in face processing, such as the left temporal pole (Gorno-Tempini et al., 1998; Griffith et al., 2006).

We have focused on discriminating one experimental condition from another with multivariate classifiers. Alternatively, representational similarity analysis (RSA) seeks voxel sets that jointly express a target similarity structure (Kriegeskorte et al., 2008a), while “generative” approaches predict whole-brain images from externally derived stimulus features or experimental condition (Mitchell et al., 2008), each typically implemented in ways that limit their ability to detect network-distributed signal. Structured sparsity may likewise provide new insights for these kinds of problems (Oswal et al., 2016).

Other groups are also exploring the utility of structured sparsity for identifying distributed neuro-cognitive representations in neuroimaging data (Carroll et al., 2009; Baldassarre et al., 2012; Jenatton et al., 2012; Rish et al., 2012; Manning et al., 2014; Cohen et al., 2017). Understanding the relations among these approaches and SOS LASSO is a central goal for future research.

### Data availability

The whole-brain imaging with Sparse Correlations (WISC) MVPA tools and code for specifying simulations can be found at https://github.com/crcox/WISC_MVPA and https://github.com/crcox/SOSLassoSimulations, respectively.

## Footnotes

This work was partially supported by Medical Research Council Program Grant MR/J004146/1 and by European Research Council Grant GAP: 670428 - BRAIN2MIND_NEUROCOMP. This research was performed using the computer resources and assistance of the University of Wisconsin (UW)-Madison Center for High Throughput Computing, which is supported by UW-Madison, the Advanced Computing Initiative, the Wisconsin Alumni Research Foundation, the Wisconsin Institutes for Discovery, and the National Science Foundation. We thank Nikhil Rao and Robert Nowak for developing the SOS Lasso loss function and its MATLAB implementation; Mark Seidenberg and Matthew A. Lambon Ralph for valuable discussions of this work in its development; and Brad Postle and Jarrod Lewis-Peacock for sharing functional imaging data with us.

The authors declare no competing financial interests.

- Correspondence should be addressed to Christopher R. Cox at chriscox{at}lsu.edu