Show Summary Details

Page of

Printed from Oxford Research Encyclopedias, Neuroscience. Under the terms of the licence agreement, an individual user may print out a single article for personal use (for details see Privacy Policy and Legal Notice).

date: 03 December 2022

# Predictive Coding Theories of Cortical Function

• Rajesh P.N. RaoRajesh P.N. RaoUniversity of Washington
•  and Linxing Preston JiangLinxing Preston JiangUniversity of washington

### Summary

Predictive coding is a unifying framework for understanding perception, action, and neocortical organization. In predictive coding, different areas of the neocortex implement a hierarchical generative model of the world that is learned from sensory inputs. Cortical circuits are hypothesized to perform Bayesian inference based on this generative model. Specifically, the Rao–Ballard hierarchical predictive coding model assumes that the top-down feedback connections from higher to lower order cortical areas convey predictions of lower-level activities. The bottom-up, feedforward connections in turn convey the errors between top-down predictions and actual activities. These errors are used to correct current estimates of the state of the world and generate new predictions. Through the objective of minimizing prediction errors, predictive coding provides a functional explanation for a wide range of neural responses and many aspects of brain organization.

### Subjects

• Computational Neuroscience

### Introduction

A normative theory for understanding perception is that the brain uses an internal model of the external world to infer the hidden causes of its sensory inputs and maintain beliefs about these causes. In the early work of Gregory and colleagues (1980), perception was defined as hypothesis testing, emphasizing the process of inferring explanations for sensory inputs. The notion that perception is an inference process based on internal models (rather than a purely bottom-up feature-extracting process) is well exemplified by the phenomenon of binocular rivalry (Tong et al., 2006). Binocular rivalry occurs when conflicting monocular images are presented separately to each of the two eyes (figure 1A). Instead of perceiving a stable mixture or superposition of the two stimuli, the subject perceives exclusively the object or feature in one of the two distinct images presented to each eye, with perception alternating between the two images every few seconds. Such “rivalry” challenges the traditional stimulus-driven feature-extraction view of perception—why would perception alternate between two interpretations if the process is completely bottom-up, given that the stimulus does not change? When perception is viewed as forming hypotheses to infer the hidden causes of images, binocular rivalry can be understood as the brain entertaining two competing hypotheses to explain a conflicting sensory input.

Having an internal model of the environment also helps disambiguate sensory inputs with multiple interpretations. Figure 1B shows an example: The two footprints on the right appear to be convex (oriented upward toward the viewer) while the two on the left appear to be concave (oriented downward away from the viewer). However, the image on the right is the same as the image on the left, only rotated 180 degrees. The two different interpretations of the footprints arise from the brain using a “light-from-above” prior assumption (Sun & Perona, 1998): the brain’s internal model assumes that light sources tend to be above the observer, an ecologically valid assumption. Such assumptions are necessary because visual perception is an ill-posed problem: Multiple 3D configurations can give rise to the same 2D image due to the projection of the 3D world onto a 2D retina, making assumptions such as “light-from-above” necessary for inferring properties of visual objects. Note that the observer is typically not aware of such prior assumptions but rather, they are incorporated by the neural circuits subconsciously to compute beliefs over hidden causes through the dynamics of neural activities (thereby implementing perception as “unconscious inference”) (Von Helmholtz, 1867). As part of the internal model, such priors can be expected to be adapted to the environment that the organism lives in.

How can neural circuits in the cortex learn internal models of the world, and how can such circuits combine prior beliefs with sensory evidence for Bayesian inference? Predictive coding offers a possible neural implementation. The predictive coding model of Rao and Ballard (1999) assumes that the areas comprising the cortical hierarchy (Felleman & Van Essen, 1991; Hubel & Wiesel, 1959) implement a hierarchical generative model of the sensory world. The neural activities at each level of the hierarchy represent the brain’s internal belief of the hidden causes of the stimuli at a particular abstraction level (e.g., edges, object parts, objects). Furthermore, the model assumes that the top-down feedback connections from higher to lower order cortical areas convey predictions of lower-level activities. The bottom-up feedforward connections in turn convey prediction errors, calculated as the difference between the top-down predictions and actual activities. The neural activities at each level representing the beliefs about the hidden causes are jointly influenced by both the top-down predictions and the bottom-up error signals. Overall, the model assumes that the goal of the cortex is to minimize prediction errors across all levels. Importantly, the above neural operations can be interpreted within a Bayesian framework: The top-down predictions convey prior beliefs based on learned expectations while the bottom-up prediction errors carry evidence from the current input. Predictive coding combines these two sources of information, weighted according to their reliability (inverse variances or “precisions”), to compute the posterior beliefs over hidden causes at each level. The objective of minimizing prediction errors across all levels can thus be shown to be equivalent to finding the maximum a posteriori (MAP) estimates of the hidden causes.

The phrase “predictive coding” was originally used to capture a form of efficient coding. The center-surround receptive fields and biphasic temporal antagonism in responses of cells in the retina and lateral geniculate nucleus (LGN) can be interpreted as performing decorrelation through a simple form of predictive coding: rather than conveying the local intensity directly, retinal and LGN cells can be interpreted as sending the differences (errors) between the local intensity and a prediction of that intensity computed as a linear weighted sum of nearby values in space and preceding input values in time (Dong & Atick, 1995a; Huang & Rao, 2011; Srinivasan et al., 1982). In auditory information processing, Smith and Lewicki (2006) used the same efficient coding principle to derive a model which yields kernels (filter weights) that closely match auditory filters.

More broadly, predictive coding can be viewed as Bayesian inference in the context of Rao and Ballard’s hierarchical predictive coding model (Rao & Ballard, 1997, 1999). This model was originally proposed to explain extra-classical receptive field effects and contextual modulation. More recent models inspired by predictive coding have demonstrated that a network trained to predict future inputs can explain a number of other cortical properties (Lotter et al., 2020; Singer et al., 2018). Beyond the cortex, the idea of computing errors between top-down predictions and lower-level inputs is consistent with theories of the cerebellum (Bell et al., 1997; Wolpert et al., 1998) and models of dopamine responses as reward prediction errors (Schultz et al., 1997). These examples suggest that the general principle of predictive coding could be a widely applicable and flexible algorithmic strategy implemented by the brain across different regions to support perception, motor control, and reward-based learning.

Empirical evidence for prediction and prediction error signals in the cortex has been growing at a fast pace. Neural responses corresponding to prediction errors induced by visual mismatches during self-generated locomotion have been discovered in layer 2/3 of the primary visual cortex (V1) in rodents (Fiser et al., 2016; Keller et al., 2012). Predictive signals have been found in V1 when an animal is adapted to visual-locomotion coupling in a virtual environment (Fiser et al., 2016). The cortex also learns to predict novel auditory stimuli coupled to an animal’s locomotion and once learned, suppresses the responses to the learned stimuli in primary auditory cortex (Schneider et al., 2018), consistent with prediction error minimization. More recent studies (Jordan & Keller, 2020) have found some support for the distinct computational roles of the laminar structure of cortical columns proposed by predictive coding theories. Recent research has also found that unexpected stimuli which induce large prediction error signals can drive synaptic learning in neural circuits (Gillon et al., 2021), as expected in a predictive coding circuit that uses prediction errors to learn a generative model of the world.

This review is organized as follows. The section “Predictive Coding Models: An Overview” introduces the Rao-Ballard predictive coding model (Rao & Ballard, 1999) and several related models, as well as the relationship to the free energy principle and active inference. The section “Predictive Coding in the Visual System” discusses the application of hierarchical predictive coding to the visual cortex, explaining classical and extra-classical receptive field effects in V1 in terms of prediction error minimization, followed by a review of experimental studies investigating predictive coding in the neocortex in the section “Empirical Evidence for Predictive Coding”. The final section discusses open questions pertaining to predictive coding and potential future directions.

### Predictive Coding Models: An Overview

The predictive coding model of Rao & Ballard begins with the assumption that sensory inputs are being generated by hidden states or “causes” in the external world via an unknown generative model. The goal of the brain then is to learn this generative model over many inputs. Perception, for any given sensory input $I$, involves inverting this generative model, that is, estimating the hidden states or causes of input $I$ given a learned generative model. Neural activities in the predictive coding model are assumed to represent estimates of the hidden state (also known as the latent variable) vector $r$ as estimated by the predictive coding neural network, given the observed sensory input vector $I$. The prior distribution of hidden states is assumed to be $pr$, which imposes a constraint on neural activities such as sparse activation. The observation model $pIr$ is the likelihood that input $I$ is generated given the cause or hidden state $r$. The predictive coding model assumes that $pIr$ is parameterized by a matrix $U$, which is assumed to be learned and encoded in the “top-down” synaptic weights of the network. Inference and learning correspond respectively to estimating $r$ (equivalent to perception) and learning an estimate $U$ (corresponding to synaptic learning), both with the goal of maximizing the joint probability $pIr$. Since $pI$ is constant, this is equivalent to maximizing the posterior probability $prI$, also known as maximum a posteriori (MAP) inference.

#### Generative Model of Images

In the predictive coding model, the likelihood $pIr$ is governed by the following equation, which relates the hidden state r to the input I via a function $f$ and a matrix $U$:

$Display mathematics$(1)

Here, $n$ is assumed to be zero mean Gaussian noise with covariance $σ2$𝟙 (𝟙 is the identity matrix). This equation states that the input is assumed to be generated as a linear combination of the columns of matrix $U$ weighted by the elements of r, followed by a function $f$ and additive noise. The function $f$ is a linear or nonlinear function (e.g., identity function, rectification function, or a sigmoidal function). The columns of $U$ can be regarded as the “basis” vectors (e.g., edges or “parts” of an image or scene) that can be used to compose an input according to the values in the hidden “causes” vector r. Given Equation 1 and the fact that n is zero mean Gaussian, the negative logarithm of the likelihood $pIr$ can be shown to be proportional to:

$Display mathematics$(2)

where $x2=∑ixi2$ denotes the Euclidean or $L2$ norm of vector x. $H1$ is the sum of squared errors between the image I and its reconstruction (or “prediction”) f(Ur) across all pixels, weighted by the inverse noise variance (or precision) $1σ2$. The predictive coding model also allows prior probability distributions $pr$ and $pU$ for the parameters $r$ and $U$, respectively. Taking these priors into account, we obtain the overall optimization function:

$Display mathematics$(2.1)

with

$Display mathematics$

where $gr$ and $hU$ are proportional to the negative logarithms of $pr$ and $pU$, respectively. If one assumes that both prior distributions are zero mean Gaussians with inverse variances $α$ and $λ$, respectively, one obtains:

$Display mathematics$

Minimizing the overall optimization function $H$ is thus equivalent to MAP estimation. Predictive coding minimizes this objective function using both inference (of $r$) and learning (of $U$). Inference of $r$ is implemented by a recurrent neural network that performs gradient descent on $H$ with respect to $r$ for each input. Remarkably, rather than being chosen a priori, the architecture of the predictive coding neural network is predicted from first principles by the gradient descent equations for optimizing $H$ with respect to $r$ (see “Network Dynamics and Synaptic Learning” section for details). The matrix $U$ is represented by the synaptic weights of the same network and learned through gradient descent on $H$ with respect to $U$ across many inputs.

##### Sparse Coding as a Special Case of Predictive Coding

The sparse coding model of Olshausen and Field (1996) for learning simple cell-like receptive fields can be regarded as a special case of the predictive coding model described above. In their model, the choice of the likelihood $pIr$ remains the same as above, but the prior $pr$ for the hidden state (causes) $r$ is assumed to be a heavy-tailed distribution such as a Laplace distribution. Such a prior encourages sparsity in $r$ (majority of the elements of $r$ are zero or close to zero). Their model does not explicitly assume any specific prior for the synaptic weights $U$. The inference and learning processes are almost identical to those for a single-level predictive coding model (see “Network Dynamics and Synaptic Learning” section). When applied to natural image patches, their model produces localized, orientation-selective receptive fields (columns of $U$) similar to those of V1 simple cells, compared to using a Gaussian prior, which produces more global receptive fields. Such a sparseness prior promotes statistical independence in the output and encourages efficiency by selecting only a small subset of features to encode information (Barlow, 2012; Olshausen & Field, 1996, 1997). The underlying assumption here is that objects in the natural world are composed of a wide variety of features (or parts) but any given object is composed of only a small subset of them. This is consistent with the view that the brain evolved to adopt ecologically useful priors for learning its neural representations in its quest to learn an internal model of the world appropriate for the organism’s ecological niche.

##### Hierarchical Predictive Coding

The above-described generative model can be extended to multiple hierarchical levels by assuming that the hidden state $r$ can be generated by a higher-level representation $rh$, corresponding to more abstract image properties than the lower-level representation:

$Display mathematics$

where $rtd=fUhrh$ is the top-down prediction of $r$ and $ntd$ is zero mean Gaussian noise with variance $σtd2$. The lower-level neurons have smaller receptive fields and represent a local image region by estimating the hidden state $r$. The higher-level neurons estimate their state $rh$ based on several lower-level hidden states $r$ associated with local image patches. This arrangement results in a progressive convergence of inputs from lower to higher levels and an increase in receptive field size as one ascends the hierarchical network (figure 2), until the receptive fields of the highest-level neurons span the entire input image.

The overall optimization function for the hierarchical predictive coding model is:

$H=1σ2I−fUrTI−fUr+1σtd2r‐rtdTr‐rtd+gr+grh+hU+hUh,$

where $grh$ and $hUh$ are terms proportional to the negative logarithm of the priors for $rh$ and $Uh$, respectively. Minimizing $H$ is again equivalent to maximizing the posterior probability $pr,rh,U,UhI$. Perceptual inference involves minimizing $H$ with respect to $r$ and $rh$ jointly, and learning involves minimizing $H$ with respect to $U$ and $Uh$. Note that the first-level state $r$ is now conditioned on the second-level state $rh$ and synaptic weights $Uh$, but an additional prior constraint such as sparseness may be placed on $r$ as well (the $gr$ term).

##### Network Dynamics and Synaptic Learning

Given the hierarchical generative model above, a MAP estimate of $r$ can be obtained using gradient descent on $H$ with respect to $r$:

$Display mathematics$(3)

where $k1$ is a positive constant governing the rate of descent toward a minimum for $H$, $x=Ur$, and $g′$ is the derivative of $g$ with respect to $r$. A discrete time implementation of the above-mentioned dynamics leads to the following update equation for $r$ at each time step (represented by neural activities or firing rates):

$Display mathematics$(3.1)

This equation, derived from first principles, specifies recurrent network dynamics for hierarchical predictive coding in terms of how the firing rate (or neural response) vector $r$ at a given level should be updated over time. At each time step, the neural activity vector $r$ is multiplied by the feedback matrix $U$ and a new prediction is generated for the lower level (figures 2A and 2B). This prediction is then subtracted from the lower-level representation $I$ to generate the bottom-up error $I−fUr$, which is filtered by the feedforward weights $U⊤$ and the gradient of the function $f$. Note that the bottom-up synaptic weights are the transpose of the top-down synaptic weights in this model, although this assumption can be relaxed using an approach similar to the one used in variational autoencoders (VAEs) (see “Predictive Coding and the Free Energy Principle” section). The neural response vector $r$ is updated based on a weighted combination of the bottom-up prediction error $I−fUr$ and the top-down prediction error$rtd−r$ (figure 2B). Each error is weighted by the inverse of the corresponding noise variance: The larger the noise variance, the smaller the weight given to that error term, consistent with the concept of Kalman filtering (see section “Prediction in Time: Spatiotemporal Predictive Coding and Kalman Filtering”).

The learning rule for the feedback synaptic weights $U$ (and feedforward weights $U⊤$) is obtained by using gradient descent on $H$ with respect to $U$:

$Display mathematics$(4)

where $k2$ is a positive parameter determining the learning rate of the network and $x=Ur$. Note that this learning rule is a form of Hebbian plasticity: for the feedforward weights $U⊤$, the input presynaptic activity is the residual error $I−fUr$ (weighted by $dfTdx$) and the output postsynaptic activity is $r$. More importantly, unlike backpropagation, the learning rule above is local since the feedforward connection explicitly conveys the prediction error at each level. To ensure stability, learning of synaptic weights operates on a slower time scale than the dynamics of $r$: The learning rate $k2$ is a much smaller value than the rate $k1$ governing the dynamics of the network. For static inputs, this implies that the network responses $r$ converge to an estimate for the current input before the synaptic weights $U$ are updated based on this converged estimate. An example two-level hierarchical network is depicted in figure 2C.

##### Feedforward Perception as the Initial Inference Step in Predictive Coding

How does the traditional feedforward “bucket brigade” model of perception, where inputs are processed sequentially in one area and passed on to the next (e.g., LGN ➔ V1 ➔ V2 . . .), align with the hierarchical predictive coding view of cortical processing? The answer to this question is easy to obtain from Equation 3.1 by considering what happens in the very first time step $t=1$ when $r̂0=0$ and the two top-down prediction terms $fUr̂0$ and $rtd$ are also both 0. In this case, if $g′0$ is also $0$, Equation 3.1 reduces to:

$Display mathematics$

Thus, the first feedforward pass through the network multiplies the input I with the feedforward weights $U⊤$ (besides the other multiplicative factors). Assuming this happens at all patches of an image, this equation describes exactly the type of operation implemented by a standard feedforward layer where the filters are given by the rows of $U⊤$. In the other words, for a static input, if the top-down predictions are assumed to be zero, a hierarchical predictive coding network (e.g., figure 2C) initializes its estimates at all levels in the same manner as a deep neural network via a feedforward pass through all layers, before proceeding to further minimize prediction errors by generating top-down predictions from these initial estimates and refining them based on prediction errors.

#### Prediction in Time: Spatiotemporal Predictive Coding and Kalman Filtering

The model described thus far focused on learning and predicting static inputs. But the world is dynamic—most of the time, animals receive time-varying stimuli either due to their own movement or due to other moving objects in the environment. This makes the ability to predict future stimuli essential for survival (e.g., predicting the location of predators). The predictive coding framework can be extended to include temporal predictions (Rao, 1999; Rao & Ballard, 1998). Specifically, the network dynamics derived above for predictive coding implements a nonlinear and hierarchical form of Bayesian inference that can be related to the classic technique of Kalman filtering (Kalman, 1960). This relationship becomes clear when we augment the spatial generative model in Equation 1 with the ability to model the temporal dynamics of hidden state $r$ from time step t to t +1:

$Display mathematics$(5)

where $Vt$ is a (potentially time-varying) transition matrix and $mt$ is zero mean Gaussian noise. Equation 5 models how a hidden state in the world, for example, the location of a predator, changes over time by assuming that the next state depends only on the current state (“Markov” assumption) plus some noise. Making the weights $Vt$ time-varying allows the equation to capture nonlinear transition dynamics.

Combining Equation 1 with Equation 5 and assuming the function $f$ is the identity function, one can derive the following equations for the network dynamics:

$Display mathematics$(6)

where $Nt$ and $Gt$ are gain terms that depend on the (co-)variances of m in Equation 5 and $n$ in Equation 1 (see Rao, 1999 for the derivation). The prediction equation takes the current estimate of the state and generates a prediction of the next state $r¯t$ via the matrix $Vt$. The correction equation corrects this prediction $r¯t$ by adding to it the prediction error $It−Ur¯t$ weighted by gain terms$Nt$ and $Gt$, with the matrix $U⊺$ translating the error from the image space back to the more abstract state space of r. The gain terms $Nt$ and $Gt$ could potentially depend on task-dependent factors and can be regarded as “attentional modulation” of the prediction error (see section “Attention and Robust Predictive Coding”) (Rao, 1998). The above equations implement a Kalman filter (see Rao, 1999).

Figure 3 illustrates a neural network implementing the spatiotemporal predictive coding model given by Equation 6: the network uses local recurrent (lateral) connections $V$ to make a prediction $r¯t$ for the next time step, translates the prediction to the lower level as $Ur¯t$ via feedback connections, conveys the prediction error $It−Ur¯t$ via feedforward connections, and then corrects its state prediction $r¯t$ with prediction error weighted by the gain term $G$.

##### Prediction and Internal Simulation in the Absence of Inputs

The spatiotemporal predictive coding model allows for the possibility that the organism or agent might want to perform internal simulations of the dynamics of the external world (e.g., for planning) by predicting how future states evolve given a starting state (and possibly actions). This can be done by setting the input prediction error gain term $Gt$ in Equation 6 to zero (see also the relationship to attention below). This results in the following network dynamics for a single-level network:

$Display mathematics$
$Display mathematics$

In this case, the network ignores any inputs and simply predicts future states moving forward in time using the learned state transition dynamics $Vt$. The network thus acts as a recurrent network, with a possibly time-varying set of recurrent weights $Vt$ to model nonlinear transitions.

For a hierarchical network, the network dynamics becomes (based on Equation 3.1):

$Display mathematics$

where $α$ is the weight assigned to the prediction $rttd$ from the higher level. Here, the network combines a local recurrent prediction $r¯t$ at one level with a prediction $rttd$ from a higher level (using the weights $1−α$ and $α$ respectively), allowing higher levels to guide the predictions at the lower levels during internal simulation, while ignoring external inputs.

##### Attention and Robust Predictive Coding

The Rao-Ballard predictive coding model can be extended to model top-down attention using a robust optimization function as first proposed in Rao (1998). Specifically, instead of using the squared error loss function:

$Display mathematics$

the robust predictive coding model uses:

$Display mathematics$

where $ρ$ is a function that reduces the influence of outliers (large prediction errors) in the estimation of $r$. As an example, $ρ$ could be defined in terms of a diagonal matrix $S$ as follows (Rao, 1998):

$Display mathematics$

where the diagonal entries $Si,i$ determine the weight accorded to the prediction error at input location i: $Ii−fuir2$ where $ui$denotes the $i$th row of $U$ ($ui$ here is a row vector). A simple but attractive choice for these weights is the nonlinear function given by:

$Display mathematics$

where $c$ is a threshold parameter. This function has the following desirable effect: $S$ clips the squared prediction error for the $i$th input location to a constant value $c$ if $Ii−fuir2$ exceeds the threshold $c$.

Minimizing the robust optimization function HR leads to the following equation for robust predictive coding:

$Display mathematics$

where $Gt$ is a diagonal matrix whose diagonal entries at time instant $t$ are given by: $Gi,i=0$ if $Ii−fuir̂t−12>ct$ and $1$ otherwise. Here $ct$ is a potentially time-varying threshold on the squared prediction error.

The gain $Gt$ acts as an “attentional filter” for outlier detection and filtering, allowing the predictive coding network estimating $r$ (figure 4, left panel) to suppress large prediction errors in parts of the input containing outliers. This enables the network to focus on verifying the feasibility of its current best hypothesis by trying to minimize prediction errors while ignoring outliers. Robust predictive coding thus allows the network to “focus its attention” on one object while ignoring occluders and background objects, and even “switch attention” from one object to another (figure 4, right panel) (see Rao, 1998, 1999).

##### What-Where Predictive Coding Networks and Equivariance

The predictive coding models above do not consider the fact that many natural inputs, such as videos, are generated by the same object or feature undergoing specific transformations such as translations, rotations, and scaling. The predictive coding model has been extended to account for such transformations using “What-Where” predictive coding (Rao & Ballard, 1998) and related models that learn transformations based on Lie groups (Miao & Rao, 2007; Rao & Ruderman, 1998) and bilinear models (Grimes & Rao, 2005).

The What-Where predictive coding model is shown in figure 5. It employs two networks to explain a new input $Ix$: one network, called the “What” network, is similar to the original predictive coding network discussed above and estimates the features or object present in the image via the state vector $r$; the other network, called the “Where” network, estimates the transformation $x$ in the new input relative to a previous (canonical) input $I0$. The network architecture and the dynamics of how $r$ and $x$ are updated are both derived from first principles through prediction error minimization (Rao & Ballard, 1998).

The What-Where predictive coding network was one of the first neural networks to demonstrate equivariance: the representation of an object in the “What” network remains stable and invariant by virtue of having a second network, the “Where” network, which absorbs changes in the input stream by modeling these changes as transformations of a canonical representation (Rao & Ballard, 1998) (cf. the more recent line of research on equivariance using “capsule” networks (Hinton et al., 2011; Kosiorek et al., 2019; Sabour et al., 2017)). The What-Where predictive coding model contrasts with traditional deep neural networks which utilize pooling in successive layers to achieve invariance to transformations but at the cost of losing information about the transformations themselves.

While its architecture is derived from the principle of prediction error minimization, the What-Where predictive coding model shares similarities with the ventral-dorsal visual processing pathways in the primate visual cortex, where ventral cortical areas have been implicated in object-related processing (“What”) and dorsal cortical areas have been implicated in motion- and spatial-transformation-related processing (“Where”).

##### Predictive Coding and the Free Energy Principle

Predictive coding and the principle of prediction error minimization are closely related to variational inference and learning, which form the basis for VAEs in machine learning research (Dayan et al., 1995; Kingma & Welling, 2014) as well as the free energy principle in neuroscience as proposed by Friston and colleagues (Friston, 2005, 2010; Friston & Kiebel, 2009). This relationship is briefly summarized below.

MAP inference, as employed in the predictive coding model above, finds an estimate $r$ that maximizes the posterior distribution $prI$. Variational inference aims to find the full posterior distribution instead of a point estimate. Applying Bayes’ rule:

$Display mathematics$

The normalizing factor (denominator) contains multidimensional integrals that are usually intractable to compute (e.g., if $pr$ is a sparsity-inducing Laplace distribution in sparse coding). Due to this intractability, variational inference approximates the posterior as follows: the true posterior probability distribution $pθ$ parameterized by parameters $θ$ is approximated with a more tractable distribution $qφ$ parameterized by parameters $φ$. The “error” between the two distributions is quantified using the Kullback-Leibler (KL) divergence between the posterior probabilities of the latent variable $r$ given the input data $I$:

$Display mathematics$(7)

where $F$ is called the “variational free energy” and $logpθI$ is called the data log likelihood (given model parameters $θ$) or model evidence. Note that variational free energy $F$ should not be confused with the physical notion of free energy (e.g., in thermodynamics), although there is a similarity in their definitions.

Rewriting Equation 7, we have:

$Display mathematics$

where $L=−F$ is called the evidence lower bound (or ELBO) in the variational learning and VAE literature since $logpθI≥L$ (the KL divergence is nonnegative). It can be seen that an organism or artificial agent can increase model evidence (data log likelihood) by maximizing the ELBO$L$ or equivalently, minimizing variational free energy $F$ with respect to the latent state and parameters. Note that since $F=KLqφrI‖pθrI−logpθI$ and $logpθI$ does not depend on $r$ or $φ$, maximizing the ELBO (minimizing $F$) with respect to $r$ and $φ$ is equivalent to minimizing the KL divergence between the approximating tractable distribution $q$ and the true distribution $p$.

To make the connection to predictive coding, the definition of variational free energy $F$ used in Equation 7 can be rewritten as follows:

$Display mathematics$

Using the relationship in Equation 2 for the logarithm of $pθIr$ and using $α$ as the constant of proportionality for Equation 2, the free energy for the predictive coding model is given by:

$Display mathematics$

Thus, within the predictive coding framework, minimizing the variational free energy $F$, as advocated by the free energy principle of brain function (Bogacz, 2017; Friston, 2010), is equivalent to finding an approximating posterior distribution $qφ$that both minimizes prediction errors while also attempting to be close to the prior for$r$. This can be regarded as a full-distribution version of the predictive coding model described above, which uses MAP inference to find an optimal point estimate that minimizes prediction errors while also being constrained by the negative logarithm of the prior (Equation 2.1).

##### Action-Based Predictive Coding and Active Inference

Prediction error can be minimized not only by estimating optimal hidden states $r$ (perception) and learning optimal synaptic weights $U$ and $V$ (internal model learning) but also by choosing appropriate actions. Inferring actions that minimize prediction error with respect to a goal, or more generally, a prior distribution over future states, is called active inference (Fountas et al., 2020; Friston et al., 2011, 2017). For example, in a navigation task, if the objective is to reach a desired goal location by passing through a series of landmarks, prediction error with respect to the goal and landmarks can be minimized by selecting actions at each time step that reach each landmark and eventually the goal location. Active inference can be regarded as an example of “planning by inference” where an internal model is used to perform Bayesian inference of actions that maximize expected reward or the probability of reaching a goal state (Attias, 2003; Botvinick & Toussaint, 2012; Verma & Rao, 2005, 2006).

Predictive coding allows internal models for action inference to be learned by predicting the sensory consequences of an executed action. For example, babies, even in the womb, make seemingly random movements called “body babbling” (Rao et al., 2007) that can allow a predictive coding network to learn a mapping between the current action and the sensory input received immediately after. After learning such an action-based prediction model via prediction error minimization, the model can be unrolled in time into the future to specify a desired goal state (or states) (see, e.g., Verma & Rao, 2005, 2006), and predictive coding-based inference can used to infer a set of current and future actions most likely to lead to the goal state(s). Some of the empirical evidence reviewed in the section “Empirical Evidence for Predictive Coding” on visual and auditory predictions based on motor activity can be understood within the framework of action-based predictive coding.

### Predictive Coding in the Visual System

#### Predictive Coding in Early Stages of Visual Processing

Early “predictive coding” models focused on explaining the center–surround response properties and biphasic temporal antagonism of cells in the retina (Atick, 1992; Buchsbaum et al., 1983; Meister & Berry, 1999; Srinivasan et al., 1982) and lateral geniculate nucleus (LGN) (Dong & Atick, 1995a; Dan et al., 1996). These models were derived from the information-theoretic principle of efficient coding (Attneave, 1954; Barlow, 2012; Simoncelli & Olshausen, 2001) rather than hierarchical generative models like the Rao-Ballard model. Under the efficient coding hypothesis, the goal of the visual system is to efficiently represent visual information by reducing redundancy arising from natural scene statistics (Dong & Atick, 1995b; Field, 1987; Ruderman & Bialek, 1994). A simple example of redundancy reduction is to remove aspects of an input that are predictable from nearby inputs. Neural activities then only need to represent information that deviates from the prediction.

Srinivasan et al. (1982) proposed that the spatial and temporal receptive field properties of retinal ganglion cells are a result of predicting local intensity values in natural images from a linear weighted sum of nearby values in space or preceding input values in time. Training a linear system that predicts the pixel intensity at a location from its surrounding pixels produces prediction weights that closely resemble the receptive fields of retinal ganglion cells (Huang & Rao, 2011; Srinivasan et al., 1982). Thus, the neural activities of retinal ganglion cells can be seen as representing the “whitened” residual errors that the system cannot predict. Srinivasan et al also showed that the linear predictor weights depend on the signal-to-noise (SNR) ratios of visual scenes. Larger groups of neighboring regions need to be integrated in order to cancel out high statistical noise in low SNR input, a phenomenon observed by the authors in the fly eye. More recently, Hosoya et al. (2005) showed that retinal ganglion cells can rapidly adapt to environments with changing correlation structure and become more sensitive to novel stimuli, consistent with the predictive coding view of the retina.

Similar ideas have been used to cast LGN processing as performing temporal whitening of inputs from the retina (Atick, 1992; Dan et al., 1996; Dong & Atick, 1995a; Kaplan et al., 1993). Dong and Atick (1995a) derived a linear model whose objective is to produce decorrelated output in the frequency domain. The optimized spatiotemporal filter compares remarkably well with the physiological data from the LGN (Saul & Humphrey, 1990). Dan et al. (1996) confirmed through experiments that the output from the LGN is temporally decorrelated (especially for lower-frequency 3–15 Hz) for natural stimuli but not white noise, suggesting that the LGN selectively whitens stimuli that match natural scene statistics. In summary, these results suggest that the early stages of visual processing (the retina and LGN) are tuned to the statistical properties of the natural environment. The same insight, implemented via a hierarchical generative model, forms the core of the Rao-Ballard predictive coding model of the visual cortex.

#### Predictive Coding in the Visual Cortex

The model presented in “Hierarchical Predictive Coding” was used by Rao and Ballard to explain both classical and extra-classical receptive fields effects in the visual cortex in terms of prediction error minimization. The cortex is modeled as a hierarchical network in which higher-level neurons predict the neural activities of lower-level neurons via feedback connections (figure 2A, lower arrows). A class of lower-level neurons, known as “error neurons,” compute the differences between the predictions from the higher level and the actual responses at the lower level, and convey these prediction errors back to the higher level via feedforward connections (figure 2A, upper arrows). Except for neurons at the highest level, neural activities at every level are influenced by both “top-down” predictions and “bottom-up” prediction errors (figure 2B). Additionally, the network is structured such that the higher-level neurons make predictions at a larger spatial scale than lower-level neurons; this is achieved by allowing higher-level neurons to predict the responses of several lower-level modules, resulting in a combined receptive field larger than any single lower-level neuron’s receptive field (e.g., in figure 2C, a single Level 2 module predicts the responses of three Level 1 modules).

The dynamics of the recurrent neural network implementing predictive coding is governed by Equation 3 and the synaptic weights are learned using Equation 4. When trained on natural image patches (figure 6, top panel), the synaptic weights that were learned in the first level resembled oriented spatial filters or Gabor wavelets similar to the receptive fields of simple cells in V1 while at the second level, the synaptic weights resembled more complex features that appear to be combinations of several lower-level filters (figure 6, Level 2).

#### Endstopping and Contextual Effects as Prediction Error Minimization

Some visual cortical neurons (particularly those in layers 2/3) exhibit the curious property that a strong response to a stimulus gets suppressed when a stimulus is introduced in the surrounding region whose properties (e.g., orientation) match the properties of the stimulus at the center of the receptive field (RF). Such effects, which have been reported in several cortical areas (Bolz & Gilbert, 1986; Desimone & Schein, 1987; Hubel & Wiesel, 1968), are known as “extra-classical” receptive field effects or contextual modulation. Hubel and Wiesel named one class of such cells in area V1 “hypercomplex” cells and noted that these cells exhibit the property of “endstopping”: The cell’s response is inhibited or eliminated when an oriented bar stimulus in the center of the cell’s RF is extended beyond its RF to the surrounding region.

Rao and Ballard (1999) suggested that endstopping and related contextual effects could be interpreted in terms of prediction errors in a network trained for predictive coding of natural images. The responses of neurons representing prediction errors (e.g., neurons in cortical layers 2/3 that send axons to a “higher” cortical area) are suppressed when the top-down prediction becomes more accurate because the larger stimulus (e.g., longer bar) engages higher-level neurons tuned to this stimulus. These neurons generate more accurate predictions for the lower level, resulting in low prediction errors. When the surrounding context is missing or at odds with the central stimulus, the prediction error responses are high due to the mismatch between the higher level’s prediction and the lower-level responses. Rao and Ballard proposed that the tendency for the higher level to expect similar statistics (e.g., similar orientation) for a central patch and its surrounding region arises from the statistics of natural images that exhibit such statistical regularities and the fact that the hierarchical predictive coding network has been trained as a generative model to emulate these statistics.

Figure 7 illustrates the prediction error responses from a two-level predictive coding network trained on natural images.

The error-detecting model neurons at the first level (with firing rates $r−rtd$) display endstopping similar to cortical neurons (figure 7B, solid curve): Model neuron responses are suppressed when the bar extends beyond the classical receptive field (figure 7A, solid curve) as the predictions from the higher level become progressively more accurate with longer bars. Elimination of predictive feedback causes the error-detecting neurons to continue to respond robustly to longer bars (figure 7A, dotted curve). The same model can also explain contextual effects (figure 7C): The first-level error detecting neurons show greater responses (solid line) when the texture stimulus at the center has the same orientation as the stimulus in the surround compared to an orthogonally oriented surround stimulus (dashed line). Similar contextual effects have been reported in V1 neurons (Zipser et al., 1996). Other V1 response properties such as cross-orientation suppression and orientation contrast facilitation can also be explained by the predictive coding framework (Spratling, 2008, 2010).

In summary, the predictive coding model suggests that (a) the physiological properties of visual cortical neurons are a consequence of statistical learning of an internal model of the natural environment—specifically, the objective of prediction error minimization allows the cortex to learn a hierarchical generative model of the natural world; and (b) perception is the process of actively explaining input stimuli by inverting a learned internal generative model via inference to recover hidden causes of the input. Context effects such as endstopping arise as a natural consequence of the visual cortex detecting prediction errors or deviations from the expectations generated by a learned internal model of the natural environment.

#### A Common Misconception About the Predictive Coding Model

One of the most common misconceptions about the predictive coding model is that the model predicts suppression of all neural activity when stimuli become predictable. This has led some authors to state that experimental evidence showing neurons not being suppressed or maintaining persistent firing for predictable inputs contradicts the predictive coding model. On the contrary, the predictive coding model requires a group of neurons to maintain the internal representation (state estimate $r̂$) at each hierarchical level for generating predictions for the lower level (see “Hierarchical Predictive Coding” and figure 8). Thus, in the predictive coding model, the neurons that are suppressed when stimuli become predictable are error-detecting neurons that are distinct from the neurons maintaining the network’s internal representation of the external world. Similar to the efficient coding models of the retina and LGN (Dong & Atick, 1995a; Srinivasan et al., 1982), redundancy reduction occurs primarily in the feedforward pathways of the Rao-Ballard predictive coding model, with the feedback pathways remaining active to convey predictions.

#### Neuroanatomical Implementation of Predictive Coding

Rao and Ballard (1999) postulated two groups of neurons at each hierarchical level with distinct computational goals (figure 8). One group of neurons maintains an internal representation (state estimate) for generating top-down predictions of lower-level activities. These neurons are hypothesized to be in the deep layers 5/6 of cortical columns and are predicted by the model to exhibit sustained activity to maintain predictions to lower levels. A different group of neurons at the same level calculates prediction errors to be conveyed to the next higher level. These were suggested to be layer 2/3 neurons which send connections to “higher” order cortical areas and which are expected to exhibit transient activity. Since prediction errors can be positive or negative, Rao and Ballard (1999) proposed two subclasses of error-detecting neurons, one subclass representing positive errors and another representing negative errors, similar to on-center off-surround and off-center on-surround neurons in the retina and LGN.

In general, as seen above in endstopping and other contextual effects, the model predicts that layer 2/3 neurons are suppressed when the stimuli are predictable (i.e., consistent with natural image statistics) while deeper layer neurons remain active. Stimuli that deviate from natural image statistics (“novel” stimuli) on the other hand elicit large responses in layer 2/3 neurons. The model also predicts that prediction error signals are used for unsupervised learning of the synaptic connections in the predictive coding network, driving the synaptic weights to better reflect the structure of the input stimuli.

### Empirical Evidence for Predictive Coding

Experimental evidence has been mounting for predictive processing in the cortex thanks to advances in neuronal recording and stimulation techniques such as optical imaging and optogenetics. Particularly relevant to the hierarchical predictive coding model proposed by Rao and Ballard (1999) are findings of top-down predictive “internal representation” neurons and bottom-up error-detecting neurons in a cortical column. These findings appear to suggest that the cortex may indeed be implementing a hierarchical generative model of the natural world. We briefly review the experimental evidence below.

#### Internal Representation Neurons and Prediction Error Neurons in the Cortex

The hierarchical predictive coding model predicts the existence of at least two functionally distinct classes of neurons in the cortex: internal state representation neurons $r$, which maintain the current estimate of state at a given hierarchical level and are postulated to reside in the deeper layers 5/6 of the cortex, and error-detecting neurons $r−rtd$ in layers 2/3, which compute the difference between the current state estimate and its top-down prediction from a higher level. Recent studies have provided evidence for both types of neurons in the cortex.

Keller and colleagues (2012) recorded neural activities from layer 2/3 cells in the monocular visual cortex of behaving mice that were head-fixed and running on a spherical treadmill. The mice were exposed to 10–30 minutes of visual feedback as they ran on the treadmill. In normal “feedback” trials, the visual flow stimuli provided to the mouse were full-field vertical gratings coupled to the mouse’s locomotion on the treadmill. In “mismatch” trials, visual-locomotion mismatches were delivered randomly as brief visual flow halts (1 second). As a control, the mice also went through “playback trials” in which visual flow was passively viewed without locomotion.

The authors found that 13.0% of the visual cortical neurons recorded responded predominately to feedback mismatches. Figure 9A shows a sample neuron (cell number 677) that responded mainly to mismatch trials (orange shading). Also, 23.6% of the neurons responded mainly to feedback trials in which visual flow feedback was predictable (cell number 452 in figure 9A). The mismatch responses were also significant in the population average (figure 9B) and the activity onset in mismatch trials was much stronger than that in the other trials. Furthermore, the mismatch signals encoded the degree of mismatch—a visual flow halt during faster locomotion resulted in a stronger response than during slow locomotion (figure 9C, darker lines denote faster speed at the time of visual flow halt).

V1 neurons have also been found to be predictive of spatial locations after adapting to a new environment. In an experiment by Fiser et al. (2016), mice went through a virtual tunnel with blocks of two different grating patterns (A or B) separated by distinct landmarks. The five trial conditions only differed in the fifth block, where the grating patterns A and B as well as omission with no visual stimuli had different probabilities of occurring (see figure 9D). After adaptation, some neurons developed predictive responses to specific visual stimuli based on spatial information. As shown in figure 9E, an example neuron (black trace) showed strong activation before the mouse perceived Block B (but not Block A). In contrast, another sample neuron (gray trace) showed activation after entering Block B (but not Block A). The authors also discovered prediction error responses similar to those reported by Keller et al. (2012). The population average of neural activities during omission trials was much greater than during A and B trials (figure 9F, left). Moreover, a subset of neurons (2.3%) developed omission-selectivity—they showed large responses only to the omission trials (figure 9F, right).

Other studies have also documented neural responses carrying predictive information. Xu and colleagues (2012) found that after rats adapted to a visual moving dot trajectory, a brief flash at the starting point of the same trajectory triggered the same sequential firing pattern in the rat’s V1 as evoked by the full-sequence stimulus.

Similarly, Gavornik and Bear (2014) discovered that after an animal is exposed to a sequence of stimuli during training, V1 regenerates the sequential response even when certain elements of the sequence are omitted.

Prediction and prediction error-like signals have also been found in cortical areas in the human visual cortex (e.g., Murray et al., 2002) and the hierarchical face processing region of the monkey inferior temporal cortex (IT) (Freiwald & Tsao, 2010; Tsao et al., 2006). Schwiedrzik and Freiwald (2017) exposed macaque monkeys to fixed pairs of face images with different head orientations and identities such that the successor face image can be predicted from the preceding face image. Neurons in the lower-level face area ML (middle lateral section of the superior temporal sulcus) displayed large responses when the pair association was violated (either in identity, or head orientation, or both). Furthermore, prediction errors resulting from view violation (head orientation) diminished and eventually vanished during the late phase of responses while those resulting from identity violation remained significant. This is consistent with the interpretation that the top-down predictive signals from the view-invariant neurons in higher-level anterior lateral and anterior medial areas suppress the view mismatch responses (encoded locally in the lower-level ML area), while identity-related mismatch signals are propagated through feedforward circuits for further processing. In another study, Issa and colleagues (2018) used different face-part configuration stimuli (typical versus atypical) and found that the lower-level areas of the hierarchy (posterior IT and central IT) signal deviations of their preferred features from the expected configurations, whereas the top level (anterior IT) maintained a preference for natural, frontal face-part configuration. The authors further discovered that the early responses in central IT and anterior IT are correlated with late responses in posterior IT: Images that produced large responses in higher-level areas early are followed by reduced activities in lower-level areas, consistent with top-down predictions signal subduing lower-level responses.

In another experiment, Choi et al. (2018) showed that a hierarchical inference model could explain the effect of feedback signals from the prefrontal cortex to intermediate visual cortex V4 as top-down predictions of partially occluded shapes.

Schneider et al. (2018) explored the effects of learning on prediction error-like activity in the primary auditory cortex. Rats were given artificial auditory feedback coupled to their locomotion: The pitch of the sound was proportional to the rat’s running speed. They found that a group of neurons in the rat’s primary auditory cortex initially responded strongly to the artificial auditory feedback (“reafferent sound”) but over the course of several days, the neuronal circuits learned to suppress this activity. The suppression occurred whenever the reafferent sound was coupled to the rat’s locomotion and did not occur when a nonreafferent sound was played or when the reafferent sound was played during resting. The gradual suppression of responses is consistent with how the predictive coding model learns an internal model of the environment: as the network learns to predict the artificial sound coupled to the rat’s locomotion, the predictions get better, resulting in decreasing prediction errors which manifest as suppression of the auditory neurons’ activities.

The results discussed above provide evidence for predictive neural activity and prediction error-like responses in the cortex. The Rao and Ballard model additionally postulates that layer 2/3 neurons compute and convey the prediction errors while neurons in the deeper layers 5/6 maintain the state estimate. Recent experiments have attempted to test these predictions. While it is hard to distinguish the state estimating “internal representation” neurons from those driven by bottom-up sensory stimuli (see review by Keller & Mrsic-Flogel, 2018 for further discussions), there is a growing body of evidence suggesting that layer 2/3 neurons may indeed play a role in comparing bottom-up information and top-down predictions.

#### Layer 2/3 Neurons as Top-Down Bottom-Up Signal Comparators

For biological networks to use prediction errors to correct their estimate, both positive and negative errors need to be represented. At any input location, a positive prediction error (($I−Ur)>0$) occurs when the input is not predicted (or incorrectly predicted) while a negative prediction error (($I−Ur)<0$) occurs when a predicted input is omitted. Rao and Ballard (1999) postulated that layer 2/3 in the cortex may employ two different groups of neurons, one to convey positive errors and another for negative errors, similar to on-center, off-surround and off-center, on-surround ganglion cells in the retina (Srinivasan et al., 1982).

To test this theory, Jordan and Keller (2020) used an experimental setup similar to the one used in Keller et al. (2012): mice ran on a treadmill with locomotion-coupled visual flow feedback. Whole-cell recordings were obtained from both layer 2/3 and layer 5/6 neurons in V1. Visual feedback could be interrupted with a brief flow halt (1 second) at random times to generate visual-locomotion mismatch events. Out of 32 neurons recorded in layer 2/3, 17 neurons showed depolarizing activities (figure 10A, left, depolarizing mismatch (dMM) neurons) and 6 neurons showed hyperpolarizing activities (figure 10A, right, hyperpolarizing mismatch (hMM) neurons) during mismatch trials. These results suggest that the dMM and hMM neurons in layer 2/3 may subserve the function of encoding positive and negative prediction errors.

In addition, 30% of the neurons exhibited significant correlations between the mismatch responses and the speed of locomotion (visual halts that occurred during faster locomotion generated “stronger” mismatch signals). The sign of the correlation was also different between dMM and hMM neurons, with dMM neurons showing a positive correlation (figure 10B, left) and hMM neurons showing a negative correlation (figure 10B, right). These results are consistent with Keller and colleagues’ calcium imaging study previously discussed Keller et al. (2012) (figure 9C), showing that the responses of layer 2/3 neurons could potentially signal the quantitative level of prediction errors.

Jordan and Keller (2020) also investigated the differences between the responses of layer 2/3 neurons and deeper layer 5/6 neurons during normal visual feedback trials and mismatch trials. A much lower ratio of neurons in layers 5/6 (5 out of 14) responded predominately to mismatch trials. Additionally, larger activity during mismatch trials was rare (1 neuron), with 7 neurons exhibiting reduced activities. The difference in responses between superficial and deep layer neurons was significant in mismatch trials (figure 10C, right) but not in normal visual flow trials (figure 10C, left). To further characterize the influence of visual flow and locomotion on layer 2/3 neurons versus layer 5/6 neurons, correlations between the activities of these neurons and locomotion speed or visual flow speed were calculated. As seen in figure 10D (left plot), the distribution of correlations in layer 2/3 was bimodal: Activities of most dMM neurons were positively correlated with locomotion speed and negatively correlated with visual flow speed (and vice versa for hMM neurons). On the other hand, activities of layer 5/6 neurons were mostly positively correlated with both locomotion speed and visual flow speed (figure 10D, right). These results suggest that layer 2/3 neurons are well-suited to computing the error between the locomotion-generated predictions of visual inputs and the actual visual input, whereas the deeper layer 5/6 neurons may integrate top-down predictions (here, from motor areas) and bottom-up input to compute an estimate of the state $r$ at the current hierarchical level.

The difference in neural responses to expected versus unexpected visual flows in layer 2/3 versus layer 5 was also confirmed in a recent study by Gillon et al. (2021). The authors used an open-loop experiment (no sensorimotor coupling to locomotion) with stimuli consisting of moving squares. Expectation violations were created in some trials by making 25% of the visual squares move in the opposite direction compared to the other 75%. The authors found that somatic and distal apical dendritic populations in layer 5 did not exhibit significantly different responses to expected versus unexpected visual flow, whereas both layer 2/3 somatic and distal apical dendritic populations showed a significant difference in responses. Additionally, this difference increased over days of exposure. Gillon et al. also found learning effects when mice were exposed to Gabor sequence stimuli for several consecutive days. The responses to unexpected stimuli (in this case, novel Gabor stimuli replacing an expected stimulus in a sequence) were predictive of how these responses evolve in subsequent sessions on a cell-by-cell basis. Besides implicating layer 2/3 neuron in prediction error computation, these results further confirm that the neural responses to unexpected stimuli (i.e., prediction errors) can drive learning in neural circuits, an important computational prediction of the predictive coding model (Rao & Ballard, 1999) (see Equation 3).

The larger distribution of error detecting neurons in superficial layers than deep layers was also confirmed by Hamm et al. (2021) in awake mice with visual oddball paradigms. The authors additionally showed that optogenetic suppression of prefrontal inputs to V1 reduced the contextual selectivity of the error detecting neurons, consistent with the effect of top-down signals in the predictive coding model. Finally, through laminar local field potential recordings in monkeys, Bastos et al. (2020) showed that predictability of visual stimuli affects neural activities in the superficial and deep layers differently—during predictable trials, there was an enhancement of alpha and beta power in the deep layers of the cortex whereas during unpredictable trials, an increase in spiking and gamma power was observed in the superficial layers.

### Discussion

By casting Bayesian inference and learning in terms of minimizing prediction errors based on an internal model of the world, predictive coding provides a unifying view of perception and learning. Perception is equated with Bayesian inference of hidden states of the world and proceeds by forming predictive hypotheses about inputs that are corrected based on prediction errors. Learning corresponds to using the inferred states to build an internal model of the world that minimizes prediction errors through synaptic plasticity. Actions can further minimize prediction errors with respect to future goals via active inference.

The hierarchical predictive coding model (Rao & Ballard, 1999) assumes that the hierarchical structure of the cortex forms predictive hypotheses at multiple levels of abstractions to explain input data. The model postulates that feedback connections between cortical areas convey predictions of expected neural activity from higher to lower levels, while the feedforward connections convey the prediction errors back to the higher level to correct the neural activity at that level, characteristics that differentiate hierarchical predictive coding from other cortical models (Heeger, 2017; Lee & Mumford, 2003).

Early empirical support for the hierarchical predictive coding model was based on its ability to explain extra-classical receptive field effects such as endstopping and other contextual modulation of responses in the visual cortex in terms of prediction error minimization (Rao & Ballard, 1999). Rao and Ballard proposed that neurons in layer 2/3 exhibiting such effects can be interpreted as error-detecting neurons whose responses are suppressed when the properties of stimuli in the center of the receptive field can be predicted by stimuli in the surround, following natural image statistics. Several recent experimental studies have discovered neurons in the visual and auditory cortex that encode predictions or prediction errors in a variety of sensory-motor tasks (Fiser et al., 2016; Keller et al., 2012; Schneider et al., 2018). Some studies have tested more detailed neuroanatomical predictions such as the role of cortical layer 2/3 neurons in error computation (Jordan & Keller, 2020). Others have shown that these error-related neural activities can drive learning in synaptic connections (Gillon et al., 2021). Although further tests are required, the experimental results reviewed above support the hypothesis that the cortex implements a predictive model of the world, uses this model to generate predictions, and utilizes prediction errors to both correct its moment-to-moment estimates and to learn a better model of the world.

There remain many aspects of predictive coding that require further exploration and experimental corroboration. For example, are layer 5/6 neurons computing and maintaining the hidden state $r$ as specified by Equation 3? Are the inverse variances in Equation 3 (“precisions” terms in the free energy principle; see Friston, 2010) computed in the cortex? If so, how are they used to weight the bottom-up and top-down terms in the predictive coding network dynamics (Equation 3)? How is this “precision”-based weighting related to attention and robust predictive coding (Rao, 1998, 1999)? More broadly, can “what-where” predictive coding networks be made hierarchical and be used to understand visual processing in the ventral and dorsal streams of the visual cortex?

Spatiotemporal hierarchical predictive coding is another area worthy of further study. Palmer et al. (2015) derived a model by solving the information bottleneck problem (Tishby et al., 1999) and suggested that retinal ganglion cells may signal predictive information about the future states of the environment, a result recently confirmed by Liu et al. (2021). Rao (1999) presented a single-level Kalman filtering model for predicting inputs one time-step ahead based on learning linear transition dynamics from input sequences. These models, however, do not address hierarchical representation of temporal information. Experimental evidence suggests that cortical representations exhibit a hierarchy of timescales from lower-order to higher-order areas across both sensory and cognitive regions (Murray et al., 2014; Runyan et al., 2017; Siegle et al., 2021). Recent work by the authors (Jiang et al., 2021) suggests that a hierarchical predictive coding model based on dynamic synaptic connections (via “hypernetworks”) can learn visual cortical space-time receptive fields and hierarchical temporal representations from natural video sequences. Ongoing work is focused on exploring the connections between such learned temporal representations and response properties in different cortical areas.

The original predictive coding model of Rao and Ballard described how a hierarchical network can converge to maximum a posteriori estimates of hidden states at different hierarchical levels. Although the model included variances for the top-down and bottom-up errors, it did not explicitly represent uncertainty. The Kalman filter version of predictive coding (Rao, 1999) does represent uncertainty in terms of a Gaussian posterior distribution, but whether the cortex can compute covariance matrices (or just the diagonal variances) remains unclear. Other theories of how the brain may represent uncertainty and perform Bayesian inference using population coding and sampling (Echeveste et al., 2020; Huang & Rao, 2016; Ma et al., 2006; Orbán et al., 2016; Rao, 2004, 2005) are complementary to predictive coding and the connections between these theories remain to be worked out.

Finally, there is much to be explored in relating predictive coding to cognition, memory, and behavior. Several studies have shown that prediction errors (or “surprise”-related signals) can drive memory reactivation and reconsolidation (Bein et al., 2020; Kim et al., 2014; Rust & Palmer, 2021; Sinclair & Barense, 2019), suggesting a role for error signals in memory updating, but the connections to predictive coding theories remain unclear. Friston and colleagues have made important contributions in establishing some of these connections (Friston, 2010; Friston et al., 2017) through the free energy principle and active inference (see the section “Predictive Coding and the Free Energy Principle”). Empirical studies such as those reviewed above have demonstrated the close links between predictive coding and active behaviors such as locomotion. We expect future predictive coding theories to incorporate actions, attention, memory, and planning. Together with new tools such as Neuropixels probes (Jun et al., 2017; Steinmetz et al., 2021) for large-scale recordings and optogenetics for stimulation, predictive coding theories can enable new paradigms for theory-driven experimentation in neuroscience.

### Acknowledgment

This material is based upon work supported by the Defense Advanced Research Projects Agency (contract number HR001120C0021); the National Institute of Mental Health (grant number 5R01MH112166); the National Science Foundation (grant number EEC-1028725); and a grant from the Templeton World Charity Foundation. The opinions expressed in this publication are those of the authors and do not necessarily reflect the views of the funders. The authors would like to thank Ares Fisher, Dimitrios Gklezakos, and Samantha Sun for suggestions, discussions, and manuscript edits.