Show Summary Details

Page of

Printed from Oxford Research Encyclopedias, Psychology. Under the terms of the licence agreement, an individual user may print out a single article for personal use (for details see Privacy Policy and Legal Notice).

date: 25 July 2021

Face Perceptionfree

Face Perceptionfree

  • Andrew W. YoungAndrew W. YoungUniversity of York


Faces carry a wide variety of socially important information, including invariant personal properties that can be used to recognize someone’s identity and highly changeable characteristics that assist in interpreting their feelings and intentions. These relatively invariant and more changeable aspects of face perception present differing everyday demands that, together with genetic and cultural influences, help determine the organization of processes involved in human face perception.


  • Neuropsychology

Theoretical Approaches

Faces loom large in our everyday lives for many different reasons. We use them to infer people’s thoughts and feelings, to help in understanding what they are saying, to recognize people we know, and to form impressions of people we haven’t met before, including their age, gender, attractiveness, and even personality.

The sheer variety of these social purposes means that different aspects of face perception can be studied from a number of different angles, and this has contributed to the considerable amount of research carried out. But the variety of purposes also raises interesting questions itself. For example, we can ask whether some types of analysis are interdependent, such as whether we must first analyze a face’s gender in order to establish its identity (Bruce et al., 1987). More generally, we can ask how the system that underlies our ability to infer so many different things from faces is organized. In particular, does face perception involve distinct specialist components dedicated to specific types of analysis needed for different social purposes? Or is it instead relatively undifferentiated because all the social properties are derived from a common physical stimulus, the face (Bruce & Young, 1986, 2012)? In this way, studies of face perception can be used to address fundamental issues about the influences that drive the organization of our brains (Young, 2018).

Such questions can be asked at different levels of analysis that can be characterized as involving functional or neural perspectives. From a functional standpoint, the focus of interest is the organization of cognitive processes and components underlying face perception, whereas from the neural perspective the focus of interest is the brain regions and neural pathways involved in perceiving faces. Functional approaches are most directly relevant to psychology and are mainly used here, but neural studies are drawn upon when they are informative about psychological questions (Kanwisher & Barton, 2011). While there is still debate concerning the extent to which functional and neural organization will map onto each other (Coltheart, 2006; Henson, 2005), it is reasonable to begin by expecting (and being reassured by) some degree of correspondence (Henson, 2005). Indeed, convergence across different sources of evidence is an important criterion for considering that a theory may offer a useful characterization (Bruce & Young, 2012).

At present, the most widely used theoretical model that can integrate functional and neural perspectives has been the model suggested by Haxby and colleagues (Haxby et al., 2000), shown in schematic form in figure 1. This model proposes a core system of regions involved in the visual processing of faces and involving distinct neural pathways for processing relatively invariant facial properties (such as personal identity, via a pathway involving the inferior occipital gyri and lateral fusiform gyrus) and relatively changeable aspects of faces that vary from moment to moment (such as gaze direction, emotional expressions, and mouth movements, via a pathway involving the inferior occipital gyri and posterior superior temporal sulcus).

Figure 1. Model of the neural network underlying face perception, from Haxby et al. (2000). The model proposes a core system for visual analysis of faces and suggests other critical neural regions that form an extended system involved in further processing. The upper left panel shows an fMRI scan of the location of face-responsive regions that form the core system: the occipital face area (OFA) in the inferior occipital gyri, the fusiform face area (FFA) in the lateral fusiform gyrus, and the posterior superior temporal sulcus (STS).

Model reproduced from Haxby et al. (2000, p. 230, Figure 5) with permission from Elsevier. RightsLink license 4991950819017. Locations of face-responsive regions reproduced from Young (2018, p. 571, Figure 2) with permission from SAGE under STM Guidelines.

Haxby et al.’s model offers a synthesis of results from studies of neural responses using functional magnetic resonance imaging (fMRI) and older functional models based on behavioral and neuropsychological evidence (Bruce & Young, 1986). Most of the fMRI work used the subtractive method to identify face-responsive brain regions (Kanwisher, 2017), but later fMRI research using adaptation to properties of face stimuli, or multivoxel analysis of the patterns of activation, has continued to emphasize the importance of the same regions (Andrews & Ewbank, 2004; Flack et al., 2019; Kovacs, 2020), as have direct measurements from intracerebral recordings (Jonas et al., 2016). What remains less clear is exactly what each of these regions contributes to the processes involved in perceiving faces (Freiwald et al., 2016; Kovacs, 2020; Yovel, 2016). However, Haxby et al.’s model forms a useful point of reference from which to consider available evidence, although it has been suggested that it may need revision in places (Calder, 2011; Freiwald et al., 2016). The pathway involving the inferior occipital gyri and lateral fusiform gyrus clearly forms part of the widely described ventral visual stream, but the pathway involving the inferior occipital gyri and posterior superior temporal sulcus sits between Mishkin et al.’s (1983) ventral and dorsal streams and can instead be considered to form a pathway specialized for social perception (Pitcher & Ungerleider, 2021).

As well as asking about the organization of face perception itself, it is important to bear in mind that faces form part of a system of interpersonal communication that can also make use of social signals offered by our voices and bodies. The relation between these different domains is therefore also a question of interest and importance that is touched on in Haxby et al.’s concept of an “extended system” that links their core system for visual analysis of faces to other brain regions (see figure 1).

A transformative development has been the availability of image manipulation techniques based on computer graphics. One of the most widely used such programs is Psychomorph (Sutherland, 2015; Tiddemann et al., 2001). Psychomorph offers useful insights by allowing researchers to change the appearance of face images in systematic ways; for example, to create averages or caricatures. A more detailed description of these techniques and some of the ways they can be used is given by Sutherland et al. (2017a). For the most part, such methods involve establishing a large number of fiducial points that determine the face’s shape in an image (for example to delineate the positions of the mouth, eyes, chin, etc.)—as shown in figure 2—and then treating the brightness and hue values of the image pixels as surface textural properties that have been overlaid onto the underlying shape. By treating an image of a face as involving a combination of shape (feature positions) and surface (brightness and color values) properties, it becomes possible to investigate the consequences of changing each of these. Although shape and surface properties covary to some extent—for example, opening the mouth both changes the shape of the mouth and creates a new surface region corresponding to what is visible of the teeth and tongue—they can sometimes be manipulated relatively independently. The InterFace software (Kramer et al., 2016) adds principal components analysis (PCA) to this powerful technical armory. PCA is a data reduction technique that finds the simplest mathematical description of the underlying dimensions of variability (principal components, or PCs) across a large set of data. PCs are generated in order of the amount of variance they capture, with the aim that the original data (in this case, images of faces) can be described accurately in relatively few novel dimensions.

Figure 2. A typical method of image manipulation, as used in Psychomorph (Sutherland, 2015; Tiddemann et al., 2001). Fiducial points are placed on the image to define its shape based on the locations of key features (left panel). These fiducials are used to tesselate the image into smaller deformable regions (right panel) on which surface properties of brightness and hue are overlaid. Different face images can then be brought to a common shape (in terms of the locations of fiducial points) before being combined—allowing averaging of different images without substantially blurring features and contours. The averaged images can then be reshaped to the average shape of constituent images, the individual shape of one of the constituent images, or any other desired shape.

Images courtesy of David Perrett. Reproduced from Bruce and Young (2012, p. 76, Figure 2.11) with permission from Informa UK Limited. PLS Clear license 46170.

Programs like Psychomorph and InterFace manipulate two-dimensional images of faces—mainly face photographs—whereas the face has a complex three-dimensional structure. However, the use of 2D static images does not seem to present substantial limitations. Indeed, we see an abundance of face photos in our daily lives, and the images of real faces that fall on the retinas of our eyes are intrinsically two-dimensional (2D). While stereopsis may offer some information about the three-dimensional (3D) shape of real faces, this will only be effective across a limited range of distances, and 2D images contain many other cues to 3D shape. Nonetheless, approaches based on 3D shape are also available (Vetter & Walker, 2011) and some are beginning to incorporate movement as well (Burt & Crewther, 2020).

Relatively Invariant Personal Characteristics

Haxby et al. (2000) used personal identity as a paradigmatic example of an invariant facial characteristic, but there are other relatively invariant properties, such as age, gender, and racial background, that are highly salient for human perceivers (Bruce & Young, 2012; Yan et al., 2017). These properties (and especially apparent age) are “relatively invariant” because our facial features do of course change across time, but slowly.

The cues that underlie apparent age are now reasonably well understood (Bruce & Young, 2012). Changes in shape reflect the growth of the underlying skull across childhood and early adulthood together with the loss of elasticity of muscles and skin in older adults and adjustments in subcutaneous fat accompanying weight gain or loss. Changes in surface texture can reflect “lifestyle” factors, which include exposure to sunlight and alcohol use, or general health. Interestingly, these influences work together (Burt & Perrett, 1995)—changing either shape or surface properties in an appropriate direction will make a face look older or younger. This observation has important theoretical and practical implications. In terms of theory, any search for a single “diagnostic” cue that determines apparent age will be fruitless, because there are so many interacting factors. In terms of practical implications, although it is now easily possible to make an image of a face look older, it is risky to use this to predict what a missing person might look like a number of years later, because the unknown influences of subsequent lifestyle on the appearance of a missing person can have such a strong effect.

Similar points apply to perception of gender. The faces of men and women differ in terms of their 3D shape (Bruce et al., 1993), as shown in figure 3, but also in terms of regional brightness (Russell, 2009), as shown in figure 4. Changing either the shape or the surface properties can make an image of a face look more masculine or more feminine in appearance, so again both shape and surface properties are involved.

Figure 3. Differences in 3D shape between male and female faces. The upper row shows average female (left) and male (right) 3D head shapes obtained by laser scanning (Bruce et al., 1993). These laser scans are purely volumetric representations devoid of surface texture; the shading in the images only represents the source of illumination. The lower row compares these average male and female 3D shapes. On the left are regions more prominent in female faces (female minus male) and on the right are regions more prominent in male faces (male minus female), with positive and negative differences plotted using the color scale shown beneath (increasingly negative differences in violet to increasingly positive in red).

Reproduced from Bruce and Young (2012, p. 107, Figure 3.6) with permission from Informa UK Limited. PLS Clear license 46170.

Figure 4. Differences in average surface properties between male and female faces. The shape of each image (positions of facial features) is identical, but the skin tone is made lighter to make the face look more feminine (left) or darker to look more masculine (right).

Reproduced from Russell (2009, p. 1215, Figure 3) with permission from SAGE under STM Guidelines.

Although race has been discredited as a biological entity (Rossion & Michel, 2011), it remains widely used as a perceptual category to encompass apparent phenotypical similarities among people with a common ethnic background. Again, perceived race involves a combination of shape differences across facial features and surface differences in skin hue and brightness (Bruce & Young, 2012; Rossion & Michel, 2011).

Therefore, perception of age, gender, and race involve multiple covarying cues. These cues are used to arrive at judgments that can often be fairly accurate and that show some evidence of automaticity (Yan et al., 2017). What is less straightforward, however, is to understand how such judgments relate to recognition of a face’s individual identity.

Figure 5 shows a sorting task introduced by Jenkins et al. (2011) that has proved instructive. A key feature of this task is that it involves multiple everyday photographs of faces instead of the more usual tactic adopted in studies that rely on a single photograph of each face taken under carefully standardized conditions; Jenkins et al. refer to these everyday photographs as “ambient images.” Participants are asked to sort the 40 ambient images into piles representing different face identities; they most frequently create around nine piles, whereas the correct solution is that there are actually only two faces (with 20 images of each). What happens is that participants tend to mistake differences between the images for differences in identity, leading them to overestimate the number of faces in the display. This is interesting because so much past research has assumed that we are “face experts” for perceiving identity, but the weak overall performance of most people sets clear constraints on this expertise (Young & Burton, 2017, 2018). Moreover, studies have also tended to assume that the main problem in face recognition is to tell the members of the superordinate perceptual category of faces apart, whereas Jenkins et al.’s data show that, for ambient images, the problem is just as much one of being able to see what is common across different exemplars that can vary in many ways, which include differences in viewpoint and lighting, but also in facial hair and hair styling, expression, and even age. Note, too, that this is a perceptual problem, not one of face memory; participants can look at and compare the photographs as much as they like.

Figure 5. Sorting task used by Jenkins et al. (2011). Participants are asked to sort the 40 images into the different face identities.

Reproduced from Jenkins et al. (2011, p. 316, Figure 2) with permission from Elsevier. RightsLink license 4991960269205.

Such findings are obtained when the faces are unfamiliar to participants. In marked contrast, the same task with familiar faces seems almost trivially easy. Somehow, our visual systems can easily deal with variability between ambient images of a familiar face, yet the same variability presents substantial problems when the face is someone we don’t know (Burton, 2013; Hancock et al., 2001). In other words, most of us can show a high degree of image-invariant recognition for familiar, but not for unfamiliar, faces. So how is recognition of a face’s identity achieved?

Many approaches to face recognition have been based on the idea of a critical role for second-order configurational processing (Carey & Diamond, 1977). The claim is that all faces share a common first-order configuration of eyes above nose above mouth, but the differences between the faces of different individuals lie in the precise positioning of these facial features (the second-order configuration). Although intuitively appealing, it is now clear that the idea that second-order configurational processing is key to the recognition of face identity cannot be correct (Burton et al., 2015). In practice, the positioning of facial features is less rigid than the theory supposes; widening your eyes, wrinkling your nose, or opening your mouth will all modify the second-order configuration. Moreover, camera properties also create substantial differences in exact feature locations across photographs that a familiar perceiver can effectively ignore (Noyes & Jenkins, 2017). Indeed, it turns out that really large changes in feature positioning that result from altering the aspect ratio to stretch or squeeze a photograph of a familiar face don’t make it unrecognizable (Baseler et al., 2016; Hole et al., 2002). Consistent with this, the surface texture pattern of a face turns out to be a more important determinant of apparent identity than the feature positions. Warping a set of photographs of familiar faces to all have the same shape (by standardizing the fiducial positions in each image to the average of the set) creates a set of images that differ mainly in surface texture, yet these images with the same shape and varying surface texture remain easily recognizable (Andrews et al., 2016). In contrast, averaging the image textures but retaining the distinctive shape (fiducial positions) of each individual dramatically affects recognition, indicating that shape has a more limited influence on perceived identity than surface properties, as shown in figure 6.

Figure 6. Contributions of shape and surface information to face identity. An average image of the face of each of eight individuals is shown along the diagonal from top left to bottom right of the figure. The other images are hybrids that combine the 2D shape (image fiducials) from one identity with the surface from another identity. Images in each row have the same surface information, and images in each column have the same shape. For example, the bottom left image combines the averaged shape of Alan Sugar’s face with the averaged surface from Louis Walsh. Perceptually, the identities seem to be relatively preserved along each row in the display and strikingly disrupted in each column, showing a relative dominance of surface over shape information for face identity.

Reproduced from Andrews et al. (2016, Figure 1, p. 283) with permission from Elsevier. RightsLink license 5015830380544.

So, at best, the second-order configuration can play a limited role. A better candidate for the mechanism that underlies recognition is what is now called holistic processing—the idea that all relevant face properties are simultaneously apprehended, so that the face is seen as a whole. This was initially thought to be closely related to the second-order configurational account, but the differences have become clear over the years (Maurer et al., 2002; Tanaka & Gordon, 2011).

Holistic processing can be demonstrated in various ways, including the part–whole effect (Tanaka & Gordon, 2011) and the composite face effect (Murphy et al., 2016; Rossion, 2013). The composite effect has been particularly widely used, though there has been debate about the best way to measure it (Richler & Gauthier, 2014; Rossion, 2013). In essence, the composite effect involves demonstrating that changing part of a face alters the overall appearance of the whole face. For example, combining the top and bottom parts of different face photographs into a face-like composite seems to create the perception of a new whole face that will then interfere with tasks that require responding to its constituent top or bottom parts. While this is an important observation, holistic processing seems fundamental to almost all aspects of face perception, including gender, age, race, unfamiliar face identity, familiar face identity, gaze, facial expression, and trait inferences (Murphy et al., 2016; Rossion, 2013). Therefore, although it may well be involved in recognizing identity, holistic processing doesn’t itself offer an explanation for why familiar face recognition shows such remarkable image invariance. Nevertheless, holistic processing of identity has been shown to arise in Haxby et al.’s core system (Andrews et al., 2010) and possibly beyond the face-responsive network (Foster et al., 2021).

A more direct approach to image-invariance in familiar face recognition involves the observation that averages created from multiple images of the same familiar face (such as those shown along the diagonal running from top left to bottom right in figure 6) are themselves highly recognizable and seem in some ways to capture the essence of that person’s identity (Burton et al., 2005); this implies that the variability in appearance of an individual face across different images involves changes around a relatively stable central tendency. In an important extension of this point, Burton et al. (2016) went on to show that the variability in appearance is nonetheless identity-specific. This demonstration was achieved by using PCA (see the section “Theoretical Approaches” for a short explanation of PCA) to analyze the dimensions underlying the variability of images of different people’s faces. In Burton et al.’s study, the input data involved reshaping a number of ambient images of the same face to a common set of fiducial positions (so that features like the eyes and mouth are in the same positions in each image, to allow meaningful analysis of surface variation) and then finding the PCs for the brightness values of the image pixels. These surface-brightness PCs turn out to be to some extent idiosyncratic; that is, the way in which one person’s face varies in appearance across different images need not be exactly the same as the ways in which someone else’s face varies. Note that usually PCA is applied to sets of images containing multiple identities, a procedure that will tend to emphasize the PCs that are common to different faces. In contrast, Burton et al. found evidence of a degree of idiosyncratic variability by analyzing multiple images of a single individual.

A major implication of Burton et al.’s (2016) finding of partially idiosyncratic variability is that we need to learn the specific nature of the variability in appearance of each of the faces we can recognize; to some extent, each face has to be learned individually. This offers a potential explanation of why performance in recognizing unfamiliar faces can often be poor; in effect, the idiosyncratic aspects of the variability of an unfamiliar face are unknown to the perceiver (Young & Burton, 2018). Based on this observation, Kramer et al. (2017a, 2018) showed that a simple top-down clustering mechanism involving linear discriminant analysis (LDA) can be applied to PCs across a set of ambient face images in order to reshape the PCA space in a way that can bring images of particular trained faces closer together and thus allow recognition of previously unencountered exemplars of these trained identities. This procedure leads to a model that shows a number of hallmarks of human performance and also has interesting emergent properties, such as ability to discriminate any face image by gender or race without explicit training of these categories (Kramer et al., 2017a). These findings were achieved with a purely linear analysis of image properties assisted by a top-down clustering algorithm for the face identities to be learned; the involvement of conceptual information in the form of things we learn about familiar people may well facilitate the top-down aspect of this process in everyday life (Schwartz & Yovel, 2016, 2019).

Other recent approaches to understanding face identity recognition (Blauch et al., 2020; O’Toole et al., 2018) have involved deep convolutional neural networks (DCNNs). DCNNs are powerful techniques that have also proved effective and that have some advantages, but their complex structure makes it relatively difficult to understand precisely what aspects of their structure are critical to their success.

Of course, recognizing a visual stimulus as a familiar face is only part of what needs to be achieved when recognizing a familiar person. The perceiver also needs to be able to bring to mind pertinent identity-specific and emotional information that can facilitate interpretation of the familiar person’s behavior and guide any personal interaction (Bruce & Young, 1986; Burton et al., 1990; Wiese et al, 2019). This complex process is clearly context-dependent; you might be talking to your friend about something that happened at work one minute and about their children the next. It can go wrong in ways that can prove informative about how our knowledge of other people is organized (Barton & Corrow, 2016; Ellis & Lewis, 2001; Ellis & Young, 1990; Young & Burton, 1999; Young et al., 1985).

Communicative Signals

The communicative signals given by our faces differ from the relatively invariant personal characteristics in that they can change remarkably quickly—often from one moment to the next.

Gaze direction communicates a person’s direction of attention and can be linked to a variety of other mental states (Tipper & Bayliss, 2011) as well as serving social functions, such as signaling turn-taking in a conversation (Kleinke, 1986). A clever demonstration from the 19th century by Wollaston (1824) revealed a form of holistic processing of gaze direction in which information about eye gaze is combined with information about the orientation of the head to interpret gaze direction, as shown in figure 7. Modern studies with photorealistic stimuli confirm this conclusion (Langton, 2000; Langton et al., 2004), and adaptation paradigms have been used further to probe how eye gaze direction is coded (Calder et al., 2007, 2008).

Figure 7. Wollaston’s (1824) gaze direction illusion. The faces in the two drawings, appear to look in different directions; the left face seems to be looking to the viewer’s right, whereas the right face seems to be gazing directly at the viewer. In each pair, however, the eyes are in fact identical, but shown in different face contexts. Our perception of gaze direction is based on combining information from the orientation of eyes and head.

Reproduced from Wollaston (1824, p. 256, Plate IX). Images are out of copyright.

However, the meaning of gaze can often be ambiguous and dependent on context for its correct interpretation. For example, prolonged eye contact can be a signal of threat or of sexual attraction. The most compelling results from brain imaging studies are therefore obtained in paradigms that link gaze to an interpretable context (Pelphrey & Van der Wyk, 2011), and these clearly show a role for posterior superior temporal sulcus (STS), consistent with Haxby et al.’s (2000) neural model.

Facial expressions signal emotional states and are also involved in a range of conventional gestures. Most studies have concentrated on a small number of expressions thought to be linked to relatively basic emotions with a distinct evolutionary background that is evident from comparisons with other species (Darwin, 1872; Ekman, 1992) and from the anatomy of the facial muscles themselves (Waller et al., 2008). Although these basic emotions are to some degree recognizable across very different cultures (Ambady & Weisbuch, 2011), there is also evidence of some degree of cultural variability that can perhaps be seen as analogous to differences in regional accent within a common language (Yan et al., 2016a).

Like gaze, even expressions of basic emotions are in part inherently ambiguous, so that context and other available cues (such as tone of voice or body posture) will influence their interpretation (Ambady & Weisbuch, 2011; de Gelder & Van den Stock, 2011; Kreifelts & Ethofer, 2018); they clearly form part of a very flexible communication system (Barrett et al., 2019).

While some facial features will obviously play a more important role in some emotions—such as a smiling mouth as a signal of happiness—holistic processing is nonetheless also evident in facial expression perception (Calder et al., 2000). There is evidence, too, that some emotions differentially engage certain brain regions that are themselves critical to triggering certain emotional responses; the most extensively studied example has been the role of the amygdala in fear (Feinstein et al., 2011). At the same time, it is clear that an understanding of emotional experience will involve multiple brain regions (Satpute & Lindquist, 2019).

Image manipulation techniques show that both shape and surface properties can play a role in recognizing facial expressions (see figure 8 for an example of this type of work) and image statistics derived from shape and surface properties correlate both with perceived similarity of expressions and neural responses from Haxby et al.’s core system (Sormaz et al., 2016a, 2016b). The role of shape was to be expected from the commonplace observation that facial expressions necessarily involve changes in feature shapes—such as the upturned corners of the mouth in a smile—but it forms a marked contrast with the more limited role of shape in recognition of face identity.

Figure 8. Contributions of shape and surface information to facial expression. The upper panel shows images created by combining shape and surface properties from averaged facial expressions of five basic emotions. Images in rows have the average surface features of each expression and images in columns have the average shape of each expression. Hence images along the top left to bottom right diagonal have the averaged shape and surface properties of the same facial expression. All other images represent hybrid combinations of shapes and surfaces from different expressions. For example, the bottom left image combines an averaged fear shape with an averaged happy surface. The lower panel shows behavioral responses when participants were asked to categorize each hybrid image’s expression as one of the five basic emotions. Percentages indicate whether the categorized expression corresponded to the shape or surface properties of the image, or when the response did not correspond to either the shape or the surface information in the image (neither). Responses based on shape and on surface were higher than responses involving neither shape nor surface, showing that both properties can be used to convey the emotional meaning to some degree. The difference between the responses based on shape and surface themselves did not reach statistical significance.

Reproduced from Sormaz et al. (2016a, Figures 5 and 6, pp. 7 and 8) with permission from Elsevier. RightsLink license 4991961483511.

The role of movement in recognizing facial expressions also needs to be considered. Most studies (cf. figure 8) rely on static images that represent the apex of the muscle movements thought to underlie an emotional expression. The fact that such images can be recognized with good levels of accuracy shows that this approach can be effective (Ekman, 1992); movement does not seem to be essential. The minority of studies that have investigated patterns of movement show mainly that these can help disambiguate some expressions that remain partly confusable in static images (Jack et al., 2014).

Where facial movement is clearly very important is in interpreting speech. Somewhat surprisingly, most of us make use of patterns of movement of the lips, tongue, and teeth as an aid to speech perception. The classic study was by Miller and Nicely (1955), who noted a substantial improvement for speech perception in background noise when the speaker’s face was visible. In considering the cause of this effect, they noted that “The place of articulation, which was hardest to hear in our tests, is the easiest of features to see on a talker's lips. The other features are hard to see, but easy to hear” (Miller & Nicely, 1955, p. 352). In this way it seems that because speech signals involve rapid temporal changes that have to be decoded as they occur, integrating complementary information from face and voice offers an optimal way of dealing with these temporal constraints. Studies of infants suggest that sensitivity to these audiovisual correspondences begins early in life (Mercure & Kischkel, 2018).

A direct demonstration of perceptual integration of auditory and visual information through “lipreading” is the McGurk illusion (Tippana, 2014), in which a video showing the face of a person saying one phoneme (for example, “ga”) is combined with a different phoneme (for example, “ba”) on the soundtrack. Remarkably, the heard phoneme can then correspond neither to the auditory nor the visual part of the video; it is usually a fusion of the two (heard as “da” in the example given); see here for an illustration. The McGurk illusion again shows that, in hearing what someone says, we can make use of the correspondence between movements of their lips (and tongue) and the speech sounds. Functional brain imaging studies show that an important region for audiovisual integration from talking faces is located in the vicinity of left posterior STS (Calvert, 2001), and this has been confirmed by demonstrating that transcranial magnetic stimulation (TMS) to this region disrupts the McGurk effect (Beauchamp et al., 2010).

Trait Impressions

The evidence reviewed so far largely bears out the importance of Haxby et al.’s (2000) distinction between changeable and relatively invariant aspects of face perception and shows differences in how these are perceptually processed. But trait impressions form an important source of perceptual inferences that are based on a combination of both changeable and invariant properties.

The term “trait impressions” refers to our tendency to infer things about unfamiliar people—especially when we encounter them for the first time. Popular magazines often have articles about “face reading,” and there have been many historic theories concerning how our faces may betray our character (Bruce & Young, 1998). Although modern research suggests there is only at best a small “kernel of truth” behind such inferences (Todorov, 2017; Todorov et al., 2015), it seems that we all make them and that they have real-life consequences that are hard to avoid (Jaeger et al., 2020). The importance of understanding how such inferences are made has been enhanced by the prevalence of images of faces on the Internet. Indeed, we are capable of arriving at remarkably fast, snap decisions from even a brief glimpse of a face photograph (South Palomares & Young, 2018; Willis & Todorov, 2006).

A striking finding about trait inferences is their partly consensual nature. Although there is not much evidence for the validity of facial impressions (Todorov, 2017; Todorov et al., 2015), there is substantial agreement between observers from the same cultural background about who looks “shifty,” “charming,” “trustworthy,” “aggressive,” and so on. It is this consensus that underpins much of the revived interest in trait impressions. Moreover, the variety of inferences seems almost endless. How can we possibly agree with each other about our impressions of such a wide range of potential traits?

The key to understanding this agreement lies in the fact that impressions involving different traits will correlate to different degrees (Todorov & Oh, 2020); for example, perceived trustworthiness and approachability are highly correlated, whereas perceived trustworthiness and threat are less strongly correlated. Groundbreaking work by Todorov and his colleagues (Oosterhof & Todorov, 2008; Todorov et al., 2008) made use of this correlational structure. They began by showing images of faces to a number of observers and asking them to describe them. From these descriptions they took the most commonly mentioned traits and added the trait of “dominance” because of its role in contemporary theories. They then had the faces rated on all of these traits and used PCA to reveal the underlying dimensional structure.

Figure 9. Two-dimensional trait space for facial impressions, from Todorov et al. (2008). The images are based on the first and second principal components (PCs) resulting from a principal component analysis (PCA) of ratings of multiple traits from 66 face photographs (a) and 300 computer-generated synthetic face images. The diagrams show how the PCs align with some representative traits (not all traits included in the PCA are shown) and present images corresponding to some of the locations within the space.

Reproduced from Todorov et al. (2008, p. 457, Figure 1) with permission from Elsevier. RightsLink license 4991951297222.

Figure 9 shows where some of the traits from Todorov’s studies fall within a two-dimensional space whose axes are the first and second PCs. The first PC clearly aligned closely with perceived trustworthiness and this accounted for more than 60% of the variance in impressions. The second PC explained less variance (around 18%) and roughly approximated perceived dominance. All of the remaining traits could then be specified in terms of where they fell in this two-dimensional space; a couple of examples are shown in figure 9.

A particularly important thing about Todorov’s approach is that it was largely data-driven; most of the traits studied were selected based on participants’ descriptions of faces rather than attempting to impose a predetermined framework. Data-driven approaches have become widespread and have transformed our understanding of social perception (Adolphs et al., 2016; Cowen & Keltner, 2021) because they can encompass the richness of complex data with minimal assumptions, but the outcomes are still to some extent determined by the sample of images used. Later work by Sutherland et al. (2013) used a larger sample of 1,000 ambient images that spanned a wider range of ages and found the three-factor structure shown in figure 10. Two of these factors (approachability and dominance) clearly approximate Todorov’s first and second PCs, but the youthful-attractiveness factor is novel; in figure 9, attractiveness simply falls within the 2D space because Todorov’s stimuli involved a more restricted range of ages. Similarly, a study of impressions of children’s faces has shown a somewhat different factor structure (Collova et al., 2019).

Figure 10. Visualization of Sutherland et al.’s (2013) three-factor structure of facial trait impressions. Averages created from the 20 images loading lowest (left column) or highest (right column) on factors of approachability (top row), youthful-attractiveness (middle row), and dominance (bottom row) in Sutherland et al.’s (2013) study.

Reproduced from Sutherland et al. (2013, p. 113, Figure 2A) with permission from Elsevier. RightsLink license 4991960470015.

An informative part of a data-driven approach can also be to work with highly variable images. Studies using different images of the same face show that trait impressions are as much impressions based on properties of specific images as of the faces themselves; most people realize intuitively that their best photo to use on a dating website may not be the best one to choose for their CV. Different images of the same face can create very different impressions (Jenkins et al., 2011; Sutherland et al., 2017b; Todorov & Porter, 2014), as shown in figure 11 using data from Mileva et al. (2019).

Figure 11. Variability in impressions across different images of the same face. The data plots show mean ratings of sets of 20 everyday ambient images of each of 10 unfamiliar male and 10 unfamiliar female faces for trustworthiness (top), attractiveness (middle), and dominance (bottom), displayed separately for male (left) and female (right) face identities. Each column represents a single face identity and each point represents a single image. Identities are ranked on the horizontal axis by mean overall score, separately for each rating. There are substantial differences in the ratings given to different images of the same face for all three judgments.

Reproduced from Mileva et al. (2019, p. 188, Figure 2) with permission from Elsevier. RightsLink license 4991960626455.

Computational approaches have shown that a substantial proportion of the variability in impressions can be captured directly from image properties (Mileva et al., 2019; Vernon et al., 2014). These studies have shown that multiple cues create each type of impression and that linear techniques that are able to exploit this cue covariation can be quite effective in modeling impressions and predicting how a particular image will be evaluated. However, it also evident that higher-level influences, such as gender stereotyping, operate to some degree (Oh et al., 2019; Sutherland et al., 2015). For example, the representations of Sutherland et al.’s three-factor structure shown in figure 10 that were created by averaging ambient images with the highest and lowest loadings on each factor clearly conform to the stereotypes that women will be more approachable and less dominant than men.

The studies of trait impressions described so far are underpinned by average consensual judgments across observers, but of course this agreement is actually less than perfect; average judgments minimize the impact of any differences between different observers. Recent work has therefore begun to explore how much of our impressions involve consensual “shared taste” and how much they are attributable to idiosyncratic “private taste” (Hehman et al., 2017; Sutherland et al., 2019).

An obvious potential source of differences in trait impressions is the observer’s cultural background, but data-driven studies are finding that, as for facial expression recognition, cross-cultural differences seem to arise against a substantial degree of underlying agreement (Sutherland et al., 2018; Todorov & Oh, 2020). This evidence of substantial cross-cultural agreement probably results from multiple causes. One likely source of impressions lies in overgeneralization from physical cues; for example, thinking that a smile signals trustworthiness (Montepare & Dobish, 2003) or that a “baby-face” appearance signals immaturity (Zebrowitz, 2017). These overgeneralizations themselves relate to wider issues of stereotyping, such as assuming that a person with an attractive face will have many other positive qualities (Dion et al., 1972). From a broader perspective, it seems plausible that the dimensions of impressions relate to fundamental mechanisms of appraisal that are common across primate species (Todorov, 2017)—Is this person disposed to help or harm me (trustworthiness/approachability)? How capable are they of carrying out their intentions (dominance)? Might they be a potential mate (youthful attractiveness)? What seems to happen is that impressions based on momentary dispositions may have some validity, but then become overgeneralized into assumed stable personality traits in the absence of other information about an unfamiliar person (Todorov, 2017; Young, 2018) in a way that is reminiscent of the fundamental attribution error in social psychology (Ross, 2018).

Individual Differences

Although there has been significant progress in understanding face perception, many issues remain unresolved. One that is of both theoretical and practical importance involves the causes of individual differences (Wilmer, 2017). Many studies of face recognition have assumed that across development humans become “face experts” capable of uniformly high levels of performance (Scott, 2011). From this standpoint, mistakes made by eyewitnesses were puzzling, and a lot of attention was given to extrinsic sources of error introduced by emotional involvement, leading questions, and the like (Lindsay et al., 2011). However, more recent studies of individual differences show a huge range of ability to recognize the identities of unfamiliar faces, ranging from very good performance by “super-recognizers” (Russell et al., 2009), to very poor performance by individuals often labeled “developmentally prosopagnosic” (Duchaine, 2011), and all shades between these extremes (Wilmer, 2017; Young & Burton, 2017, 2018).

The differences in performance are often found in face-matching tasks that have no memory component; they simply involve deciding whether different photos show the same person (for example, figure 5). Face matching is of special interest because it offers an analog of the task faced by passport officers (Does this passport photo really show the person standing before me?) or by anyone checking photo ID in everyday life. It turns out that the performance of passport officers is as variable as that of the rest of the population, despite their years of experience and training (White et al., 2014). In fact, people can be trained to become better at unfamiliar face matching, such as professional forensic examiners, but this training involves learning techniques that seem quite different from everyday recognition (Hu et al., 2017).

Studies of facial expression recognition also show clear individual differences (Connolly et al., 2019), raising interesting questions about how they relate to differences in perceiving and recognizing face identity. There is evidence of a general factor that underlies performance across a range of face-perception tasks (Verhallen et al., 2017), but the shared variance is seldom more than 20%, leaving much remaining variability to be accounted for. Structural equation modeling and related techniques are proving useful here; for example, by estimating how facial expression recognition relates to intelligence and to ability to recognize emotions more generally (Connolly et al., 2020).

This variability of performance across individuals places constraints on the applicability of the widely used concept of face expertise. The criteria for expert ability need to be carefully considered and, instead of assuming acquired expertise for all aspects of face perception, theorizing needs to take seriously the possibility that human observers are expert for some aspects of face perception (such as recognition of familiar face identity) and not others (such as recognition of unfamiliar face identity; Young & Burton, 2018). A further constraint on thinking solely in terms of acquired expertise with faces comes from accumulating evidence of genetic influences on face-recognition ability (Shakeshaft & Plomin, 2015; Wilmer et al., 2010; Zhu et al., 2010).

Outstanding Theoretical Issues

Returning to the central issue of whether face perception involves distinct specialist components or instead relies on relatively undifferentiated processing, it is clear that the evidence at present falls in favor of functional specialization. However, it is also apparent that there is substantial cross-talk within the system and that the segregation between different processes is likely less than complete (Young, 2018). For example, PCA shows that some PCs are useful in analyzing both identity and expression (Calder et al., 2001) and adaptation studies show interactions between identity and expression (Rhodes et al., 2015). Although a more detailed understanding is needed, then, the key theoretical question involves what determines this organization.

The evolved structure of the brain offers an obvious starting point, but it has proved controversial. While genetic influences on face perception have been demonstrated, there is debate about whether they reflect something face-specific or more general mechanisms (Gauthier et al., 2000; Kanwisher, 2000; McKone & Robbins, 2011). That said, it does seem that newborn infants are naturally attentive to faces (Atkinson, 2017; Maurer & Mondloch, 2011) and that category-selective responses to faces are present in the infant brain (Deen et al., 2017). Many researchers also take the evidence of significant cross-cultural agreement in the interpretation of emotional facial expressions and trait impressions as further evidence of potential evolutionary influences (Ekman, 1992).

What keeps this debate alive is that evidence of evolved brain structure for face perception is indirect and can often be interpreted in other ways. In contrast, evidence that our brains are to some degree modifiable through experience is abundant, including observations of how as infants our perceptual abilities become tuned to the type of faces we see (Lee et al., 2011) and how as adults we show other-race effects in perceiving unfamiliar face identity and facial expressions (Rossion & Michel, 2011; Yan et al., 2016b). Indeed, our capacity for learning is so great that the number of familiar faces we can nowadays recognize vastly outstrips the number that would have been encountered across the period when any face-specific mechanism might have evolved (Jenkins et al., 2018).

Like so many nature or nurture debates, this one looks likely to be settled by accepting that there will be a complex mix of both sources of influence (Honeycutt, 2019; Leopold & Rhodes, 2010). Given the likely importance of sociality in human evolution (Dunbar, 2016), it is thought that social factors, including the demands of communication and identity recognition, have contributed to the structure of human heads and faces (Lacruz et al., 2019; Sheehan & Nachman, 2014). This makes it plausible that our brains have also adapted in some way to facilitate the task of perceiving faces. Notably, the effects of experience don’t seem to extend so far as to create a different structure in different individuals’ brains; fMRI studies show remarkable consistencies in the locations and time-courses of neural responses to faces (Hasson et al., 2004).

An often-neglected potential determinant of functional organization is the demands of everyday life (Hasson et al., 2019; Young, 2018; Young et al., 2020). While the implications of image variability are of critical importance in understanding face perception, the consequences of variability are themselves fundamentally different for changeable and relatively invariant characteristics. For changeable signals, the differences between images carry much of the information needed to interpret the changes (Mileva et al., 2019; Vernon et al., 2014). For invariant characteristics, such as identity, however, image variability has in a sense to be discounted to recognize a familiar face. For this reason, variability has often been considered to constitute “noise” that merely hinders recognition, but it now seems more likely that it is informative about the identity-specific variability that the face-recognition mechanism must learn to deal with (Bruce, 1994; Burton, 2013; Burton et al., 2016). This can be seen in demonstrations that learning new faces from single photographs leads to poor generalization of recognition to new images of the same face (Bruce, 1982; Longmore et al., 2008), whereas learning from highly variable images shows the good generalization found in everyday familiar face recognition (Devue et al., 2019; Dowsett et al., 2016; Kramer et al., 2017b).

It is important, too, to keep in mind the fact that face perception is part of a wider system of interpersonal perception and communication. Examining how facial information is integrated with other social signals involving our voices and bodies is instructive. In general, it is the relatively invariant signals that seem like they are usually decoded from the face or voice alone, whereas the changeable signals involve much closer integration between different domains (Young, 2018; Young et al., 2020). This contrast is particularly clearly seen in neuropsychological studies where deficits in recognition of familiar people following brain injury can mainly affect face recognition (prosopagnosia) or mainly affect voice recognition (phonagnosia), forming a double dissociation between neuropsychological deficits affecting recognition of an invariant characteristic (identity) across these different domains of face and voice (Schweinberger & Zäske, 2018; Young et al., 2020). The pattern of neuropsychological deficits affecting emotion recognition is very different, and problems recognizing emotional facial expressions almost invariably co-occur with problems in recognizing vocal emotion when this is tested (Young, 2018). Much the same points arise in behavioral research, where long-term priming effects for identity recognition tend to be domain-specific (Young et al., 2020), whereas trait impressions and emotion recognition combine cues from different domains (de Gelder & Van den Stock, 2011; Mileva et al., 2018), and individual differences in emotion recognition tend to implicate a supramodal factor that involves faces, voices, and bodies (Connolly et al., 2020).

Again, this pattern seems to mirror the contrasting demands of everyday life. Relatively invariant characteristics, such as gender or familiar identity, can be accessed from face or voice alone and also create few temporal demands because they don’t change during a social encounter; they can therefore be dealt with by a domain-specific system. In contrast, changeable characteristics are often inherently ambiguous and at the same time put a premium on monitoring signals from moment to moment. A system that can pool complementary sources of information across different domains then represents an optimal solution (Young et al., 2020). How this integration is achieved is a major theoretical issue (Cao et al., 2019; Ernst & Bülthoff, 2004; Teufel et al., 2019).

At an even deeper level, too, face (and voice and body) perception forms part of an integrated system used to make sense of other people’s behavior. There are striking parallels between dimensions of trait impressions from faces and factors that influence expressed partner preferences from questionnaires (South Palomares et al., 2018), between trait impressions and the underlying conceptual organization of person construal (Stolier et al., 2020), and between perceived similarities between basic emotions expressed through face and voice and the meanings of these basic emotions themselves (Brooks & Freeman, 2018; Kuhn et al., 2017). These deep similarities seem to demand explanation (Cowen & Keltner, 2021). Again, they may at least in part reflect a common background to the interpretation of the behavior of conspecifics shared with other primate species.

Further Reading

A fairly detailed textbook:

  • Bruce, V., & Young, A. (2012). Face perception. Psychology Press.

The implications of cognitive neuroscience and neuropsychological approaches:

Important perspectives on recognising face identity:

Cue integration and the effect of context on interpreting communicative signals:

A well-illustrated, scholarly and highly readable book on facial trait impressions:

  • Todorov, A. (2017). Face value: The irresistible influence of first impressions. Princeton University Press.

An overview of individual differences:

Sensitive periods in the development of face perception abilities:

  • Maurer, D., & Mondloch, C. (2011). Sensitive periods in face perception. In A. J. Calder, G. Rhodes, J. V. Haxby, & M. H. Johnson (Eds.), The Oxford handbook of face perception (pp. 779–798). Oxford University Press.

A very useful resource for research: