Show Summary Details

Page of

date: 25 September 2017

Speech Perception and Generalization Across Talkers and Accents

Abstract and Keywords

The seeming ease with which we usually understand each other belies the complexity of the processes that underlie speech perception. One of the biggest computational challenges is that different talkers realize the same speech categories (e.g., /p/) in physically different ways. We review the mixture of processes that enable robust speech understanding across talkers despite this lack of invariance. These processes range from automatic pre-speech adjustments of the distribution of energy over acoustic frequencies (normalization) to implicit statistical learning of talker-specific properties (adaptation, perceptual recalibration) to the generalization of these patterns across groups of talkers (e.g., gender differences).

Keywords: speech perception, talker variability, invariance, normalization, adaptation, generalization, perceptual learning, distributional learning, motor theory, articulatory recovery

1. Overview

The overarching goal of speech perception research is to explain how listeners recognize and comprehend spoken language. One of the biggest challenges of speech perception is the lack of a one-to-one mapping between acoustic information in the speech signal and linguistic categories in memory. This so-called lack of invariance in speech (Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967) stems from a host of factors (see Klatt, 1986, for an overview). The physical properties of speech sounds produced by a given talker vary across productions due to factors such as tongue position, jaw position, and the temporal coordination of articulators (Stevens, 1972; see also Marin, Pouplier, & Harrington, 2010), as well as articulatory carefulness in formal versus casual speech (Lindblom, 1990), speaking rate, emotional state (Protopapas & Lieberman, 1997), and the degree of coarticulation with adjacent sounds (see Ladefoged, 1980; Öhman, 1966). Further, when comparing across talkers, variability arises due to differences in anatomical structures such as vocal tract length and vocal fold size (Fitch & Giedd, 1999; Peterson & Barney, 1952), as well as differences due to age (Lee, Potamianos, & Narayanan, 1999), gender (Perry, Ohde, & Ashmead, 2001), idiolectal articulatory gestures (Ladefoged, 1989), and regional or non-native accent (Labov, Ash, & Boberg, 2006). As a result, the physical realization of the same speech category can differ greatly over time, especially when produced by different talkers (Hillenbrand, Getty, Clark, & Wheeler, 1995; Peterson & Barney, 1952; Potter & Steinberg, 1950). For example, an adult female’s production of /ʃ/ might be very similar to an adult male’s production of /s/ due to the influence of vocal tract size on spectral center of gravity (one of the primary cue dimensions that indicates place of articulation for fricatives). Similarly, one talker’s realization of the vowel /ε‎/ (as in said) might sound like another talker’s realization of the vowel /æ/ (as in sad) due to cross-dialectal differences in the realization of these vowels. Figure 1 illustrates this many-to-many mapping problem for a hypothetical category contrast. Note that Figure 1 illustrates the consequences of within- and between-talker variability along a single category-relevant acoustic cue dimension; however, speech is high dimensional, and speech categories are often signaled by multiple cues (Fox, Flege, & Munro, 1995; Jongman, Wayland, & Wong, 2000).

Speech Perception and Generalization Across Talkers and AccentsClick to view larger

Figure 1. Schematic example illustrating the lack of invariance in speech and the resulting generalization problem. Top panel: Categories are realized as distributions of cue values (C1 and C2) produced by two talkers with different distributions and category boundaries (dots indicate the point of maximal overlap between categories for each talker). For both talkers, lower cue values in the talker’s range map to C1. However, Talker 2’s productions are shifted relative to Talker 1’s, indicating the fact that talker-related factors such as vocal anatomy and accent affect the overall range of cue values that different talkers produce. Further, Talker 2 has a more peaked distribution for C2, compared to Talker 1, indicating less variability (greater precision) in the acoustic realization of this category. Bottom panel: Categorization function of an ideal observer model (which closely match human behavior), showing the probability of hearing C2 for each cue value (for discussion of ideal observer models, see Section 4.2). Categorization is at chance where a cue is equally likely to have come from either category (see top panel). Note the relationship between the shaded region in the top panel and the categorization functions in the bottom panel. Cue values that fall within this shaded region are strongly associated with C1 when produced by Talker 2, but the same cue values produced by Talker 1 are more likely to map onto C2. Listeners must be able to generalize their implicit knowledge about the mapping between cues and categories across talkers, while adjusting for the fact that these categories are realized in different ways by different talkers.

The lack of invariance in speech leads to an inference and generalization problem. To achieve perceptual constancy in the face of highly variable speech input, listeners must be able to generalize knowledge about the sound structure of their language across words, phonological contexts, talkers, accents, and speaking styles. Indeed, despite the ubiquity of variability in speech, listeners tend to understand native speakers of their language with surprisingly little difficulty. That is, listeners tend to succeed in mapping acoustic input to the linguistic categories intended by the talker. Even in extreme cases, such as listening to a talker with a heavy foreign accent amid a noisy background, listeners can often overcome initial perceptual difficulties. For example, with relatively brief exposure to foreign accented speech, listeners show improvements in both processing speed (Clarke & Garrett, 2004) and category identification accuracy (Baese-Berk, Bradlow, & Wright, 2013; Bradlow & Bent, 2008; Reinisch & Holt, 2014; see also Romero-Rivas, Martin, & Costa, 2015). Thus, the speech perception system is able to adjust for the fact that the same physical cue values map onto different categories with different probabilities depending on the talker, and conversely that the same category can map onto different cue values (or even entirely different cue dimensions; Smith & Hayes-Harb, 2011; Tosacano & McMurray, 2010; see also Schertz, Cho, Lotto, & Warner, 2015).

Precisely how the systems involved in speech perception cope with variability has been, and continues to be, a central and hotly debated theoretical issue in the field (for extensive discussion, see Cutler, Eisner, McQueen, & Norris, 2010; Johnson, 2005; Pisoni, 1997; Strange, 1989). Some theories assume the existence of invariant aspects of speech (e.g., acoustic invariants or invariant phonetic-articulatory gestures) that uniquely define phonetic categories (e.g., Fant, 1960; Galantucci, Fowler, & Turvey, 2006; Myers, Blumstein, Walsh, & Eliassen, 2009; Stevens & Blumstein, 1981). According to these theories, surface variability in the form of an utterance is uninformative, if not irrelevant, with respect to the mapping of speech input to linguistic categories. Other theories assume that variability within and across talkers is highly informative and plays a fundamental role in speech perception (Elman & McClelland, 1986; Goldinger, 1998; Holt, 2005; Kleinschmidt & Jaeger, 2015b). According to the latter view, listeners are sensitive to the relationship between acoustic variability (within and across talkers) and phonetic categories. For example, the categorization functions in the bottom panel of Figure 1 show the statistically optimal cue-to-category mapping function based on talker-specific knowledge of the distribution of variability associated with each category.

This article provides an overview of research on the role of variability in speech perception. We focus primarily on talker variability and the issue of cross-talker generalization. However, the mechanisms for coping with talker variability also play a role in how listeners cope with within-talker contextual variability (for further discussion of this point, see Nusbaum & Magnuson, 1997). We aim to provide an overview of the critical concepts and debates in this domain of research; to chart significant historical developments; to emphasize underlying assumptions and the evidence that supports or opposes those assumptions; and to highlight overlap among lines of research that are often viewed as orthogonal or opposing. We organize the current discussion around four questions that have guided research on talker variability: (a) To what extent are there invariant aspects of speech? (b) How is the speech signal adjusted during (the early stages of) processing? (c) How do listeners make us of the statistical distribution of variability across talkers, such as systematic variation due to a talker’s accent, sex or other social group membership? And (d) how is such information learned? We conclude by indicating important directions for future research.

2. To What Extent Are There Invariant Aspects of Speech?

Early approaches to the issue of talker variability in speech perception assumed that acoustic variability in the realization of speech sounds was perceptual “noise” that obscures the abstract symbolic content of the linguistic message. To understand how listeners achieve perceptual constancy when confronted with noisy input, a large body of research focused on identifying invariant aspects of speech that uniquely identify phonemic categories (e.g., Fowler, 1986; Ladefoged & Broadbent, 1957; Peterson, 1952; Shankweiler, Strange, & Verbrugge, 1975). According to this approach, listeners’ ability to generalize the sound structure of their language across talkers—i.e., to recognize physically different speech signals from different talkers as instances of the same speech category—is the result of the perceptual system focusing on or extracting invariant aspects of speech and ignoring variability due to the talker’s idiolect, accent or vocal anatomy/physiology.

2.1 Acoustic Invariance

Within this tradition, some researchers aimed to identify invariant acoustic information in the speech signal: that is, category-specific acoustic cues that are produced the same by all talkers (Cole & Scott, 1974; Fant, 1960). The theory of acoustic invariance is most fully elaborated for stop consonants (e.g., /b/, /d/, /g/, etc.; Kewley-Port, 1983; Lahiri, Gewirth, & Blumstein, 1984; Walley & Carrell, 1983). Several studies argued that the shape of the spectrum (the distribution of energy as a function of frequency) at the release of a stop consonant is an invariant cue to place of articulation (Blumstein & Stevens, 1979; Halle, Hughes, & Radley, 1957; Stevens & Blumstein, 1977, 1978; Zue, 1976). As shown in Figures 2 and 3, the gross shape of the short-term spectrum at the time of burst release is diffuse and either falling or flat for labial consonants (e.g., /b/), diffuse and rising for alveolar consonants (e.g., /d/), and compact for velar consonants (e.g., /g/). Thus, it was proposed that a perceptual mechanism that samples the short-term spectra at the time of burst release can reliably distinguish stop consonants with a place of articulation contrast (see, e.g., Stevens & Blumstein, 1981). Indeed, early automatic phoneme recognition systems that implemented such a mechanism achieved considerable accuracy in classifying stop consonants produced by different talkers (Searle, Jacobson, & Rayment, 1979). For further elaboration of acoustic invariance for stop consonant place of articulation, see research on formant transitions (e.g., Delattre, Liberman, & Cooper, 1955; Story & Bunton, 2010).

Speech Perception and Generalization Across Talkers and AccentsClick to view larger

Figure 2. Examples of waveforms and short-term spectra (boxes) sampled at the release of three voiced and voiceless stop consonants as indicated. Superimposed on two of the waveforms (for/ba/and/pu/) is the time window of width 26 msec used for sampling the spectrum. Short-time spectra are determined for the first difference of the sampled waveform (for details, see source).

(Reproduced with permission from Blumstein & Stevens, Journal of the Acoustical Society of America, 1979, 66, 1001–1018. Copyright 1979, Acoustical Society of America.)

Speech Perception and Generalization Across Talkers and AccentsClick to view larger

Figure 3. Schematization of the diffuse-rising, diffuse-falling, and compact templates designed to capture the gross spectral shapes characteristic of alveolar (e.g., /d/, /t/), labial (/b/, /p/), and velar (/g/, /k/) places of articulation, respectively. The diffuse templates require a spread of spectral peaks across a range of frequencies, with increasing energy at higher frequencies for the diffuse-rising template and a falling or flat spread of energy for the diffuse-falling template. The compact template requires a single spectral peak in the mid-frequency range.

(Reproduced with permission from Blumstein & Stevens, Journal of the Acoustical Society of America, 1979, 66, 1001–1018. Copyright 1979, Acoustical Society of America.)

For a theory of acoustic invariance to provide a sufficient account of talker variability and speech perception, invariant category-distinguishing acoustic cues must be identified for the full set of sounds in a language. To date, however, sufficient cues have not been identified for speech sounds like vowels and fricatives, in part because the physical properties of these sounds are highly dependent on the talker’s vocal anatomy and accent, as we discuss further below. It should be noted that while the search for acoustic invariance has not yielded a viable account of speech perception, this line of research markedly advanced the understanding of the spectral properties of speech, which has had far reaching benefits: inter alia, improving the quality of synthesized speech (see Klatt, 1987).

2.2 Articulatory/Motor Invariance

Another approach to invariance has focused on articulatory gestures, arguing that the invariant aspects of speech are not part of the acoustic signal but rather part of the production processes that generate the signal (Iskarous, Fowler, & Whalen, 2010; Sussman, Fruchter, Hilbert, & Sirosh, 1998). The central tenet of the motor theory of speech perception is that the objects of speech perception are the “intended phonetic gestures” of a talker (Liberman, 1982; Liberman et al., 1967; Liberman & Mattingly, 1985). This claim is based on several assumptions about the architecture of the speech processing system. First, motor theory assumes that speech production and speech perception are tightly linked and share the same representations. Second, this theory assumes that speech sounds are represented in the brain as “invariant motor commands that call for movements of the articulators through certain linguistically significant configurations” (Liberman & Mattingly, 1985, p. 2, emphasis added): e.g., the category [m] is described as a combination of a labial gesture and a velum-lowering gesture. While the abstract category-specific motor commands are assumed to be invariant, the physical execution of these commands naturally varies across utterances and talkers. Thus, Liberman and Mattingly (1985, p. 3) argue that:

[t]o perceive an utterance is to perceive a specific pattern of intended gestures. We have to say “intended gestures,” because, for a number of reasons (coarticulation being merely the most obvious), the gestures are not directly manifested in the acoustic signal or in the observable articulatory movements. It is thus no simple matter . . . to define specific gestures rigorously or to relate them to their observable consequences.

On this view, speech perception involves reconstructing the production plan. In other words, speech input is perceived as the intended phonetic gestures by internally deriving the gestures involved in producing the speech signal (e.g., analysis by synthesis). As the quote above indicates, however, one of the challenges faced by motor theory is to provide an explicit account of how speech input is translated into ostensibly invariant motor commands.

One answer to this challenge holds that the objects of speech perception are the actual physical gestures produced by a talker, rather than the intended gestures (see the direct realist view of speech perception, which is broadly related to the motor theory, but differs in many of the basic assumptions; Best, 1995; Fowler, 1986, 1991; Gibson, 1966). For physical gestures to be the objects of speech perception, these gestures must be “perceivable” even when listeners have no visual information about the physical production of speech sounds (e.g., when talking on the phone or listening to the radio). Research in the field of automatic speech recognition has demonstrated that articulatory gestures can be recovered from the acoustic signal, without any corresponding visual articulatory information (for a review, see Schroeter & Sondhi, 1994) and that these recovered gestures can indeed guide speech recognition (e.g., Mitra, Nam, Epsy-Wilson, Saltzman, & Goldstein, 2012).

Any theory of speech perception based on articulatory recovery must (at minimum) account for anatomical and postural differences between talkers; otherwise, acoustic variation resulting from such factors might be wrongly attributed to differences in the movement of articulators, or vice versa. One proposal concerning talker variability and articulatory recovery (see, e.g., McGowan & Berger, 2009; McGowan & Cushing, 1999) starts from the assumption that speech perception relies on an internal articulatory model, which comprises a talker-independent representation of the human vocal tract, along with knowledge of the acoustic consequences that result from different gestural configurations. When listeners hear speech, talker-specific anatomical features are estimated from the speech signal, and these estimates are used to adjust the internal vocal tract representation (for estimation methods, see, e.g., Hogden, Rubin, McDermott, Katagiri, & Goldstein, 2007). Articulatory gestures can then be recovered by comparing the observed speech input to the output of different configurations of the adjusted internal model. McGowan and Cushing (1999) demonstrated that this approach aids the recovery of category-relevant articulatory movements from male and female talkers who differ, inter alia, in vocal tract length and palette height.

There is a considerable body of evidence showing that the perception of speech sounds is indeed influenced by information about the physical production of those sounds (for extensive discussion, see Galantucci et al., 2006; Vroomen & Baart, 2012), such as visual information about articulatory movements—as in the case of the classic McGurk effect (McGurk & MacDonald, 1976)—or haptic information gathered by touching a talker’s face during articulation (Fowler & Deckle, 1991; Sato, Cavé, Ménard, & Brasseur, 2010). These findings indicate that speech production can provide information that guides speech perception (as claimed by auditory theories of speech processing that include a role for motor knowledge; see Figure 4a). However, these findings are insufficient to support the claim that articulatory gestures form the sole basis of speech perception (see Figure 4b). In fact, there are several reasons to doubt this claim.

Speech Perception and Generalization Across Talkers and AccentsClick to view larger

Figure 4. Coarse schematic models of speech perception illustrating the fundamental difference between auditory and motor theories of speech perception. (a) Schematic of an auditory theory. Acoustic speech input activates auditory-phonological networks, which in turn activate lexical-conceptual networks. (b) Schematic of a motor theory. Acoustic speech input must make contact with motor speech systems to access lexical-conceptual networks.

(Reprinted from Hickok, Holt & Lotto, Trends in Cognitive Science, 2009, 13(8), 330–331, with permission from Elsevier Inc.)

One of the main issues with gesture-based theories is that speech production and speech perception can be disrupted independently, which calls into question the assumption that production is required for perception. For example, expressive aphasia, also known as Broca’s aphasia, is a language disorder that is characterized by severe disruption of speech production processes as a result of brain damage (e.g., a brain lesion or stroke), but often only mild, if any, disruption to perception and comprehension processes (see, e.g., Naeser, Palumbo, Helm-Estabrooks, Stiassny-Eder, & Albert, 1989). The linguistic abilities of expressive aphasics are particularly problematic for the motor theory, given that this theory assumes motor-based representations of speech categories that are shared between production and perception (Hickok, Costanzo, Capasso, & Miceli, 2011; Lotto, Hickok, & Holt, 2009; though see Wilson, 2009, for a counterargument).

Further evidence for a dissociation between perception and production comes from studies that demonstrate human-like speech perception phenomena in animals, despite the fact that the animals being tested lack the anatomical apparati to produce speech. For example, chinchillas show human-like categorical perception of speech sounds—e.g., abrupt rather than gradual changes in perception of voiced /d/ and voiceless /t/ when tokens are varied along a voice onset time continuum (Kuhl & Miller, 1975). Many animals also show a human-like ability to differentiate phonological categories while ignoring talker-related variability in the realization of those categories: e.g., zebra finches (Ohms, Gill, Van Heijningen, Beckers, & ten Cate, 2010); ferrets (Bizley, Walker, King, & Schnupp, 2013); rats (Eriksson & Villa, 2006); chinchillas (Burdick & Miller, 1975); and cats (Dewson, 1964). It is unlikely that these animals evolved to have a mental representation of the human vocal tract or specialized knowledge of the motor commands used to produce speech sounds. Thus, the findings from these animal studies pose a challenge for theories that place gesture-based knowledge at the center of speech perception (for further discussion, see Kriengwatana, Escudero, & ten Cate, 2015).

3. How Is the Speech Signal Adjusted During the Early Stages of Processing?

A third approach to variability in the realization of speech categories proposes that invariance is achieved via perceptual processes that warp or transform the speech signal. This approach is often referred to as normalization: speech perception is taken to effectively normalize variability by interpreting certain aspects of the speech signal in relation to other aspects: e.g., adjusting the perception of voice onset times based on the talker’s speaking rate (Newman & Sawusch, 1996) or adjusting the perceived distribution of energy at different frequencies based on an estimate of the talker’s vocal tract size (see Johnson, 2005, for a review of talker normalization). In other words, unlike the accounts in the previous section that were—at least in their original conceptions—concerned with absolute acoustic or articulatory invariance, normalization accounts are often concerned with relational invariance (e.g., Sussman, 1989).

Before discussing the merits and limitations of normalization approaches, we begin with a detailed example that illustrates one of the core phenomena addressed by this line of research: variability resulting from the talker’s vocal anatomy/physiology. We focus specifically on vocal tract-related vowel variability, which has played a central role in the normalization literature (for discussion, see reviews by Adank, Smits, & van Hout, 2004; Johnson, 2005). Note, however, that normalization accounts have also been developed for consonants (Holt, 2006; Johnson, 1991; Mann & Repp, 1980) and lexical tone (Fox & Qi, 1990; Huang & Holt, 2009; Moore & Jongman, 1997).

Adult men tend to have longer vocal tracts than adult women due to laryngeal descent (i.e., the lowering of the larynx in the throat) during puberty (Fitch & Giedd, 1999). Longer vocal tracts resonate at lower frequencies than shorter vocal tracts (Chiba & Kajiyama, 1941). Thus, vowel productions from adult men tend to be characterized by formants (acoustic resonances of the vocal tract) at lower frequencies than corresponding vowels produced by adult females (see the top left panel of Figure 5; see also Huber, Ash, & Johnson, 1999). This biological difference has consequences for vowel perception. The first and second formants (F1 and F2) vary systematically by vowel type (Ladefoged, 1980) and are two of the primary cue dimensions used in vowel identification (Fox et al., 1995; Verbrugge, Strange, Shankweiler, & Edman, 1976; Yang & Fox, 2014): for example, the vowel /u/ (as in the word suit) is characterized by a lower F1 and (in many varieties of English) F2 than the vowel /ʊ/ (as in the word soot). As a result of laryngeal descent during puberty, and hence formant lowering, the F1 and F2 of an adult male’s realization of /ʊ/ (the vowel with the relatively higher formants) might match the F1 and F2 of an adult female’s realization of /u/, as shown in the middle left panel of Figure 5 (see also Hillenbrand et al., 1995; Peterson & Barney, 1952).1

Speech Perception and Generalization Across Talkers and AccentsClick to view larger

Figure 5. Example of cross-talker vowel variability before (left) and after (right) normalizing for differences in vocal-tract length based on F3. Top panel: the average vowel space for adult male and adult female talkers in the vowel corpus collected by Hillenbrand et al. (1995). Talkers in this corpus are from the northern dialect region of American English. Bottom panel: the degree of overlap among adult male productions of /ʊ/ and adult female productions of /u/. Plots show individual vowel tokens (small dots), category means (large dots), and 95% confidence ellipses. Note that the unnormalized male and female vowel spaces (top left panel) have approximately the same geometric configuration, but the male vowels are characterized by comparatively lower absolute F1 and F2 values, which reflects the fact that longer vocal tracts resonate at lower frequencies than shorter vocal tracts. As a result, the distribution of adult male tokens of /ʊ/ is highly overlapping with the distribution of adult female tokens of/u/in F1xF2 space (bottom left panel). That is, the same acoustic information maps onto different phonological categories with different probabilities depending on the talker’s sex. Hence neither F1 nor F2 provides reliable information for discriminating these vowel categories across talker sex. Normalizing F1 and F2 based on F3 (which is correlated with vocal tract length) considerably reduces the difference between the average male and female vowel spaces (top right panel), while preserving the overall shape of the space (i.e., the relative position of vowels). As a result of F3-normalization, tokens of /ʊ/ and /u/ are less overlapping (bottom right panel) and, hence, more discriminable despite vocal tract differences across talkers.

The example in Figure 5 highlights several facts. First, there is not a one-to-one mapping between category-relevant acoustic cues and linguistic categories: e.g., one talker’s [u] is another talker’s [ʊ]. Second, anatomical differences across talkers have a systematic influence on the acoustic realization of speech sounds: e.g., there is a relationship between vocal tract length and resonance frequencies. Third, talkers with the same dialect maintain the same structural relationships among speech sounds: e.g., the contrast between the vowels /u/ and /ʊ/ and the relative position of these vowels in acoustic-phonetic space. According to the normalization approach, the speech perception system copes with acoustic variability by capitalizing on relational aspects of speech. In other words, it is not the absolute value of category-relevant speech cues that matters for speech perception, but rather the relationship among various cues.

3.1 Normalization as an Automatic Auditory Process

Normalization has a long history in research on speech perception, and a number of related normalization mechanisms have been proposed (see, e.g., Barreda, 2012; Irino & Patterson, 2002; Joos, 1948; Lloyd, 1890a; Lobanov, 1971; Nearey, 1989; Nordström & Lindblom, 1975; Strange, 1989; Zahorian & Jagharghi, 1993). Some normalization accounts are purely auditory, such as ratio-based accounts2 in which category-relevant speech cues (e.g., F1 and F2 for vowels) are normalized based on acoustic correlates of the talker’s vocal tract length, such as a talker’s fundamental frequency (F0) or third formant (F3; Bladon, Henton, & Pickering, 1984; Claes, Dologlous, Bosch, & van Compernolle, 1998; Halberstam & Raphael, 2004; Miller, 1989; Monahan & Idsardi, 2010; Nordström & Lindblom, 1975; Peterson, 1961; Sussman, 1986; Syrdal & Gopal, 1986). Figure 5 shows an example of F3-normalization in which the F1 and F2 from a set of American English vowels (left panel) are converted to F1/F3 and F2/F3 ratios (right panel), which dramatically reduces vocal tract-related variability across talkers with the same accent (see Monahan & Idsardi, 2010, for further discussion). Other normalization accounts assume an articulatory basis for speech perception. For example, in McGowan and Cushing’s (1999) articulatory recovery model (see above), vocal tract normalization is the first step in extracting category-relevant gestural information from the speech signal. What the majority of these proposals share is the belief that normalization of the speech signal results from automatic pre-categorical auditory processes (Huang & Holt, 2011; Sjerps, McQueen, & Mitterer, 2013; Sussman, 1986; Watkins & Makin, 1996).

Early work on normalization was motivated in part by how the peripheral auditory system encodes the frequency content of speech. In humans (and other mammals), sound frequency discrimination begins in the cochlea. The basilar membrane, which is part of the cochlea, is tonotopically organized, meaning that different regions respond to different frequencies. Specifically, hair cells that are positioned further along the membrane respond to progressively lower frequencies. Building on these basic aspects of sound perception, Potter and Steinberg (1950, p. 812) proposed that “within limits, a certain spatial pattern of stimulation on the basilar membrane may be identified as a given sound regardless of position along the membrane.” In other words, Potter and Steinberg proposed a ratio-based account of normalization in which the peripheral auditory system perceives the relationship among co-occurring formants, rather than perceiving individual formants. In a similar vein, Sussman (1986) developed a simulation-based model of vowel normalization and representation that involved “combination-sensitive neurons,” which integrate information from multiple formants before mapping the normalized input to abstract representations of vowel categories. This line of research suggests that normalization is a low-level process, both in terms of the point in the processing stream at which the adjustments occur (i.e., pre-categorical adjustments to the speech signal, as opposed to higher-level adjustments that affect the mapping of phonetic percepts to phonological categories) and in terms of the perceptual systems that are responsible for these adjustments (i.e., the peripheral auditory system).

It is worth noting that the frequency-position map of the basilar membrane—the relationship between acoustic frequency and position along the membrane—appears to be logarithmic over most of the cochlea’s range of frequency sensitivity (Greenwood, 1961). Thus, log-transforming frequency—or converting the frequency scale to Bark or mel, which are units of measurement that are (approximately) logarithmically related to frequency—is a type of normalization that aims to capture the biological structure of the inner ear and the psychophysics of sound perception (Adank et al., 2004; Sussman, 1986; Syrdal & Gopal, 1986).

Some of the most compelling behavioral evidence for speech normalization comes from experiments showing context-dependent shifts in the perception of speech sounds (Holt, 2005; Laing, Liu, Lotto, & Holt, 2012; Lindblom & Studdert-Kennedy, 1967; Mann, 1980; Mann & Repp, 1980). These studies provide evidence for normalization as a pre-categorical process, though not necessarily a cochlear or peripheral auditory process (see, e.g., Holt & Lotto, 2002; Sjerps et al., 2013). In a seminal study, Ladefoged and Broadbent (1957) found that perception of a target utterance—a word that was relatively ambiguous between “bit” (with an [ɪ]) and “bet” (with an [ε‎]) due to the frequency of F1—shifted depending on the formant structure of the preceding carrier phrase. Ladefoged and Broadbent (1957) manipulated the F1 and F2 of all vowels in the carrier phrase, either lowering or raising them. The lowered or raised F1 and F2 across vowels in the carrier phrase thus suggested a talker with a relatively longer or shorter vocal tract, respectively. Ladefoged and Broadbent found that this manipulation had a spectrally contrastive effect on the perception of the target vowel: when the carrier phrase had raised vowel formants, the vowel in the target word tended to be heard as [ɪ] (as in “bit”), which has a lower F1 than [ε‎]; when the carrier phrase had lowered formants, the target tended to be heard as the relatively higher vowel [ε‎] (as in “bet”). That is, listeners interpreted the vowel in the target word as relative to the talker’s vowel space. Thus, the speech perception system compensated for talker-related variability as revealed in the preceding utterance (see also Ladefoged, 1989).

Building on this seminal finding, Holt and colleagues (Holt, 2005, 2006; Huang & Holt, 2009) demonstrated that the same spectrally contrastive perceptual shift occurs even when the carrier phrase is replaced with a series of non-speech sine-wave tones. In these experiments, a constant speech target was interpreted as relatively higher when preceded by a series of pure tones sampled from a distribution of low-frequency tones, but as relatively lower when preceded by a series of tones drawn from a distribution of high-frequency tones. The fact that both speech and non-speech contexts elicit this shift suggests that pre-categorical normalization results from general, rather than speech-specific, auditory processes that are sensitive to the relational properties of the acoustic input (Laing et al., 2012). The finding of spectrally contrastive perceptual shifts suggests that speech perception is sensitive to the statistical distributions of spectral information in the local context, even if these distributions include nonlinguistic spectral information.

Findings by Holt and colleagues further indicate that normalization does not result solely from peripheral auditory mechanisms. Holt and Lotto (2002), for example, found that the spectrally contrastive effect of context occurred even when the preceding context and the target utterance were presented in different ears. This finding suggests that normalization is due, at least in part, to central auditory processes because information from both ears must have been integrated in order for the context-dependent perceptual shift to emerge.

3.2 Category-Intrinsic vs. Category-Extrinsic Normalization

One dimension along which normalization proposals differ is the type of information that is used to perform the normalization (for discussion, see Ainsworth, 1975; Johnson, 1990). Category-extrinsic procedures involve normalizing a category token based on an external frame of reference, such as information from the preceding utterance or context (Holt, 2005, 2006; Sjerps et al., 2013). In an early but highly influential proposal, Joos (1948, p. 61) argued that vowel information is perceptually evaluated on a talker-specific “coordinate system” that the listener “swiftly constructs” based on information from other vowels from the same talker. By contrast, category-intrinsic procedures rely exclusively on information from a given category token to normalize that token. An example of category-intrinsic normalization is calculating the interval between adjacent formants of a given vowel token in order to isolate the relative pattern of formants independent of the absolute formant frequencies (e.g., F2F1, F1F0; Syrdal & Gopal, 1986).

Research on the role of F0 in talker normalization serves to highlight the complementarity of category-intrinsic and category-extrinsic approaches. F0 results from the periodic pulsing of the vocal folds and is correlated with vocal cord size and (indirectly) with vocal tract size. F0 therefore provides a vowel-intrinsic reference point for normalizing variability due to the talker’s vocal anatomy (for various instantiations of this approach, see, e.g., Hirahara & Kato, 1992; Johnson, 2005; Katz & Assmann, 2001; Syrdal & Gopal, 1986) However, one limitation of vowel-intrinsic F0 normalization is that listeners can recognize vowels with a high degree of accuracy even when F0 is not present in the signal (Tartter, 1991), as in the case of whispered speech (i.e., there is no periodic pulsing of the vocal folds during whispered speech because the vocal folds are held tight). Vowel-extrinsic F0 normalization (Miller, 1989) provides a potential solution because the formants in whispered vowels could be normalized based on an aggregate measure of the talker’s fundamental frequency, calculated over previous tokens in which F0 was present.

The evidence discussed above suggests that normalization involves sensitivity to both syntagmatic relationships—e.g., the relationship between a given speech sound and aspects of the surrounding context—and paradigmatic relationships—e.g., the relationships among category-internal sources of information. This leaves open the possibility that listeners draw on a wide variety of cues, possibly weighing them in accordance to their informativity. This possibility receives some support from a review of proposed vowel normalization algorithms, which found that vowel-extrinsic procedures performed better than vowel-intrinsic procedures in achieving relational invariance (Adank et al., 2004). Since category-extrinsic information is more available and plentiful than category-intrinsic information (the latter is limited, by definition), extrinsic cues are a priori more likely to yield reliable information about talker-related sources of variation, and hence to provide a stable baseline for normalization.

3.3 Normalization and Learning

While normalization algorithms have been shown to reduce talker-related acoustic variability, particularly due to vocal anatomy (see, e.g., Figure 5), this approach has been met with a number of criticisms. We briefly review some of the most important criticisms (for extensive discussion, see Johnson, 1997, 2005). Then we discuss an aspect of these criticisms that has received comparatively little attention: the relationship between normalization and learning.

One criticism of normalization accounts is that instance-specific details of perceived speech are retained in long-term memory and influence subsequent speech processing (Bradlow, Nygaard, & Pisoni, 1999; Goldinger, 1996; Palmeri, Goldinger, & Pisoni, 1993; Schacter & Church, 1992), which indicates that acoustic variability is not “filtered out” during the early stages of processing. These findings spurred a tremendous body of research into speech perception. According to episodic (Goldinger, 1996, 1998) and exemplar-based approaches (Johnson, 1997; Pierrehumbert, 2002; Pisoni, 1997), detailed representations of speech episodes play a central role in how listeners cope with talker variability.

Two related criticisms of normalization accounts come from cross-linguistic studies. First, the exact difference between adult male and adult female vowel formants varies across languages and cannot be reduced to differences in vocal anatomy (Johnson, 2006; see also Bladon et al., 1984; Johnson, 2005). This suggests that cultural factors such as gender-norms contribute to patterns of variation in speech, above and beyond biologically-determined variation, such as sex-based vocal tract differences after puberty. As further evidence of this point, boys and girls in some cultures show adult-like differences in speech production (e.g., boys producing lower formants) long before laryngeal descent during puberty, and thus before biological factors would explain the difference (Johnson, 2005; Lee et al., 1999; Perry et al., 2001). Second, normalization procedures that are effective in reducing inter-talker variability when applied to data from one language are not necessarily equally effective when applied to corresponding data from another language (see, e.g., Disner, 1980).

These cross-linguistic findings raise questions about how normalization processes come into existence. A priori, there are at least three logically possible scenarios: (a) normalization involves a genetically-determined invariant mapping from genetically-determined cues to categories; (b) normalization involves a variable mapping function from genetically-determined cues to categories, with the mapping function being learned through exposure; (c) normalization is simply the use of an invariant mapping function to relate cues to categories, but both the cues and the nature of the mapping function are learned from exposure (e.g., learning that F0 and F3 are related to the talker’s vocal anatomy/physiology and hence can help normalize source-related variability; and further learning how these cues vary due to cultural factors in the listeners’ target language). The first of these scenarios seems unlikely in light of the cross-linguistic evidence cited above. The other two scenarios involve some degree of learning, which is typically not discussed in the normalization literature and is sometimes taken to be incompatible with accounts of automatic low-level processes. However, there is increasing evidence that even some of the lowest level cellular mechanisms in the human perceptual system appear to learn and adapt (Brenner, Bialek, & de Ruyter van Steveninck, 2000; Fairhall, Lewen, Bialek, & de Ruyter van Steveninck, 2001). Taken together, these criticisms suggest that learning processes (e.g., learning of language-specific or talker-specific variation) play an important role in how the speech perception system copes with variability and how listeners are able to generalize knowledge of the sound structure of their language across talkers and utterances. Further, these criticisms suggest that there might be no clear division between perception and learning. We turn to the issue of learning in the next section.

4. How Do Listeners Make Use of The Statistical Distribution of Variability?

A prominent line of recent research on talker variability and perceptual constancy capitalizes on the fact that variability in speech is the rule, rather than the exception, by adopting a view of human perception that is dynamic, adaptive, and context-sensitive (see Bradlow & Bent, 2008; Clayards, Tanenhaus, Aslin, & Jacobs, 2008; Eisner & McQueen, 2005; Kraljic & Samuel, 2006; Maye, Werker, & Gerken, 2002; Pisoni, 1997; Pisoni & Levi, 2007). Indeed, this is increasingly how cognitive scientists see all of the brain, even low-level perceptual areas (Gutnisky & Dragoi, 2008; Sharpee et al., 2006; Stocker & Simoncelli, 2006). Instead of searching for inherently invariant properties of speech, this approach seeks to understand how the systems involved in speech perception track, learn, and respond to patterns of variation in the environment (for discussion, see Elman & McClelland, 1986; Kleinschmidt & Jaeger, 2015b; Samuel & Kraljic, 2009). This approach is based on the fundamental belief that the distribution of variability associated with speech categories—and the fact that different talkers can have different distributions (see Figure 1)—is highly informative (see also Heald & Nusbaum, 2015). As Pisoni (1997, p. 10) explains, “stimulus variability is, in fact, a lawful and highly informative source of information for the perceptual process; it is not simply a source of noise that masks or degrades the idealized symbolic representation of speech in human long-term memory.”

A similar point was noted by Liberman and Mattingly (1985): “systematic stimulus variation is not an obstacle to be circumvented or overcome in some arbitrary way; it is, rather, a source of information about articulation that provides important guidance to the perceptual process” (pp. 14–15, emphasis added). For Liberman and Mattingly, who were proponents of motor theory, the primary focus was on the types of information provided by phonological context: e.g., in the case of coarticulation, systematic variation in formant transitions between a stop consonant and vowel provide information about consonant place of articulation. The research discussed below extends beyond sources of information provided by phonological context to include any source of systematic variation in speech: e.g., a talker’s age, sex, gender, accent, speaking rate, or idiosyncratic speech patterns.

We begin by discussing evidence that speech perception is guided by listeners’ knowledge of how variability is distributed in the world (e.g., how patterns of pronunciation variation are distributed across talkers and social groups). We then discuss research concerned with the learning mechanisms that enable listeners to achieve this sensitivity to the distributional aspects of speech.

4.1 Talker Perception and Speech Processing

Sociolinguistic research over the last several decades has shown that listeners have rich and structured knowledge about the distribution of variability across groups of talkers (see e.g,. Campbell-Kibler, 2007; Clopper & Pisoni, 2004b, 2007; Labov, 1966; Preston, 1989). Listeners use this social knowledge to help generalize knowledge of the sound structure of their language across talkers (see Foulkes & Hay, 2015, for a recent overview). This line of research has demonstrated that speech perception can be influenced by expectations about the talker’s dialect background (Hay, Nolan, & Drager, 2006; Niedzielski, 1999), age (Drager, 2011; Hay, Warren, & Drager, 2006; Koops, Gentry, & Pantos, 2008; see also Walker & Hay, 2011), socio-economic status (Hay, Warren, & Drager, 2006), and ethnicity (Staum Casasanto, 2008) in cases where these social attributes covary statistically with patterns of pronunciation variation in the target language.

For example, Hay and colleagues found that unprompted expectations about an unfamiliar talker—based on visually cued social attributes of the talker—influenced perception of vowel variation in New Zealand English (Hay, Warren, & Drager, 2006). In New Zealand English, the diphthongs /iә/ and /eә/ (as in the words near and square, respectively) are in the process of merging. This change-in-progress is most advanced among younger speakers and members of lower socio-economic groups, whereas older and more affluent speakers tend to maintain the vowel contrast. In one study, Hay, Warren, and Drager (2006) presented listeners with minimal pairs like beer and bare produced by New Zealand talkers who maintained the vowel distinction. Photographs were used to manipulate the perceived age and socio-economic status of the talkers. Results of a two-alternative forced-choice identification task (e.g., Did the talker say the word beer or bare?) showed that identification accuracy was worse when the talker appeared to be younger or from a lower socioeconomic group than when the talker appeared to be older or more affluent (see also Drager, 2011). That is, when the talker appeared to belong to a social group with merged vowels, the target stimuli tended to be treated as homophonous, creating uncertainty about the intended word and resulting in a higher rate of identification “errors.” Crucially, since the speech stimuli were identical across conditions, the difference in identification accuracy can only stem from listeners’ expectations based on the visually cued attributes of the talker.

Relatedly, Niedzielski (1999) found that simply informing listeners about a talker’s ostensible regional background led to differences in how the same physical vowel stimulus was perceived (see also Hay & Drager, 2010). In this study, listeners from Detroit, Michigan, heard target words containing a raised variant of the diphthong /aw/ (e.g., about pronounced more like “a boat”), a phenomenon known as Canadian raising. The listeners’ task was to identify the best match between the vowel in the stimulus word and one of six synthesized vowel tokens, which ranged from a standard-sounding realization of /aw/ to a raised vowel variant. When told the speaker was from Canada, rather than Detroit, listeners were more likely to match the target vowel to one of the raised variants on the synthesized continuum, reflecting the fact that Detroit residents attribute Canadian raising to the speech of Canadians and are virtually unaware of this feature in their own speech.

Sensitivity to the covariance between social factors and the realization of speech categories does not stop at the level of social group membership. Listeners have also been found to be sensitive to talker-specific patterns of variation (Creel, 2014; Goldinger, 1996; Kraljic, Brennan, & Samuel, 2008; Kraljic & Samuel, 2006, 2007; for relevant discussion, see Creel & Bregman, 2011). Using an exposure-test paradigm, Nygaard, Sommers, and Pisoni (1994) found that listeners were better able to recognize new words produced by familiar talkers than words produced by unfamiliar talkers, as indicated by identification accuracy at test for words in noise (see also Nygaard & Pisoni, 1998). The fact that the benefits of exposure generalize to new words from the familiar talkers indicates that listeners learn and use knowledge of talker-specific pronunciation patterns to guide processing of new tokens from those talkers. Trude and Brown-Schmidt (2012) found that when listening to multiple familiar talkers who produce different variants of the same speech category, providing listeners with talker-indexical cues on each trial (e.g., a picture of the talker or a snippet of speech that did not contain the target speech category) facilitated the use of knowledge of talker-specific pronunciation variation.

Taken together, the findings discussed above indicate that listeners’ knowledge about the distribution of variability within and across talkers and social groups provides a rich backdrop against which to evaluate speech. These findings have been successfully accommodated in episodic and exemplar-based models of speech perception (Goldinger, 1998, 2007; Johnson, 1997; Pisoni & Levi, 2007; Sumner, Kim, King, & McGowan, 2014), as well as certain Bayesian approaches (e.g., the ideal adapter; Kleinschmidt & Jaeger, 2015a, b). One of the central tenets of these approaches is that listeners draw on their experience with category exemplars to learn how linguistic variability (e.g., the realization of the near and square vowels in New Zealand English) covaries with social factors. Listeners then leverage this knowledge to predict the likelihood with which certain speech cues map to higher-level linguistic categories. On this view, listeners are expected to be less certain about the cue-to-category mapping when social factors suggest that the talker is likely to have the near-square merger, resulting in more categorization errors, which is what Hay, Warren, and Drager (2006) found.

4.2 Adaptation and Perceptual/Distributional Learning

In order for talker-specific or group-based information to be useful in speech perception, listeners must first learn the patterns of variation that are associated with particular talkers or groups of talkers. A large body of research—much of it in recent years—has investigated the mechanisms that track and respond to patterns of variation in speech input (see Aslin & Newport, 2012; Samuel & Kraljic, 2009). The conceptual foundations of this research can be traced in part to seminal work on perceptual learning by James and Eleanor Gibson in the 1950s and ’60s (Gibson, 1969; Gibson & Gibson, 1955). Gibson (1969, p. 3) defined perceptual learning as “an increase in the ability to extract information from the environment, as a result of experience and practice with stimulation coming from it.” The premise of this view is that perception is fundamentally shaped by the perceiver’s existing knowledge and past experiences in such a way as to facilitate processing of the input, rather than being an objective translation of the physical world into units of perception. The appeal of this view for theories of speech perception is that speech categories (e.g., /b/ vs. /d/, /u/ vs. /ʊ/) need not be distinguished by a fixed set of acoustic, articulatory or relational invariants. Rather, through experience with specific talkers or groups of talkers, listeners can learn the cue dimensions and distributions of cue values that are relevant for distinguishing speech categories produced by those talkers (Clayards et al., 2008; Idemaru & Holt, 2011; Kleinschmidt & Jaeger, 2015b; Liu & Holt, 2015; Maye et al., 2002; Theodore & Miller, 2010). In other words, the speech perception system adapts. For the present discussion, we use the term adaptation to refer to the outcome of a learning mechanism (see Goldstone [1998] for an ontology of perceptual learning mechanisms).3

One of the classic demonstrations of perceptual learning for speech is that listeners dynamically recalibrate phonetic category boundaries in response to variation in the speech input (Bertelson, Vroomen, & de Gelder, 2003; Norris, McQueen, & Cutler, 2003). For example, when listeners encounter a talker whose realization of /s/ is acoustically ambiguous between [s] and [f], listeners adjust their category boundary to perceive the otherwise ambiguous stimulus as /s/ (i.e., as an instance of the category intended by the talker). This phonetic recalibration effect can be driven by lexical knowledge, such as hearing the ambiguous sound in a disambiguating lexical context: e.g., hearing “platypu[?sf]” for platypus, an /s/- final word with no /f/- final counterpart (Kraljic & Samuel, 2005; McQueen, Cutler, & Norris, 2006; Norris et al., 2003). Phonetic recalibration can also be driven by visual information: e.g., hearing a sound that is acoustically ambiguous between [b] and [d], but seeing the talker produce the labial closure for [b] (Bertelson et al., 2003; see Vroomen & Baart, 2012, for a recent review) and by statistical knowledge about contingencies among acoustic-phonetic cues (Idemaru & Holt, 2011).

Perceptual learning for speech helps listeners cope with talker variability by tailoring speech perception processes to patterns of variation in the input. As evidence of pattern abstraction, adaptation to atypical segmental variation ([?sf] for /s/) generalizes to new words that are pronounced with the segmental variant (Maye et al., 2008; McQueen et al., 2006; Mitterer, Chen, & Zhou, 2011; Mitterer, Scharenborg, & McQueen, 2013; Sjerps & McQueen, 2010; Weatherholtz, 2015; see also Greenspan, Nusbaum, & Pisoni, 1988). This finding of generalization indicates that listeners abstracted over the trained word forms to learn a sublexical pattern of variation, as opposed to simply encoding the atypical word forms experienced during training (e.g., “platypu[?sf]” for platypus; McQueen et al., 2006).

A central question in this line of research concerns the conditions under which pattern abstraction is talker-specific versus talker-independent (see Bradlow & Bent, 2008; Kraljic & Samuel, 2007; Reinisch & Holt, 2014). Both can be beneficial. For example, when a property of speech is idiosyncratic to a talker, an ideal adapter should learn talker-specific expectations, adapting expectations for only that talker. However, when patterns of variation occur across talkers (e.g., dialect or accent variation), an ideal adapter should learn talker-independent but group-specific expectations in order to generalize learning to new talkers with the same dialect or accent (Kleinschmidt & Jaeger, 2015b). There is some evidence that human listeners behave in ways that are qualitatively and quantitatively similar to ideal adapters (Bejjanki, Clayards, Knill, & Aslin, 2011; Clayards et al., 2008; Kleinschmidt & Jaeger, 2011, 2012, 2015a; Kleinschmidt, Raizada, & Jaeger, 2015). For example, exposure to multiple talkers with the same accent or dialect facilitates cross-talker generalization by helping listeners distinguish talker-independent patterns of variation from inter-talker variability in the realization of those patterns. This effect of exposure conditions on learning outcomes has been observed for a range of perceptual learning phenomenon: adapting to foreign-accented speech (Bradlow & Bent, 2008; Gass & Varonis, 1984; Sidaras, Alexander, & Nygaard, 2009); learning new perceptual categories, such as Japanese-learners of English acquiring the /l/ − /r/ contrast (Lively, Logan, & Pisoni, 1993; Logan, Lively, & Pisoni, 1991), and learning to classify talkers by regional dialect (Clopper & Pisoni, 2004a).

Some results suggest that multi-talker exposure is a necessary pre-condition for talker-independent adaptation (Bradlow & Bent, 2008; Lively et al., 1993). For example, Bradlow and Bent (2008) found that listeners who were familiarized to five Mandarin-accented English talkers were subsequently able to generalize learning to a novel talker with this accent, indicating talker-independent adaptation. However, when listeners were initially familiarized to a single Mandarin-accented English talker, adaptation was talker-specific: i.e., listeners were subsequently better able to understand the trained talker’s speech in noise, but accent adaptation did not generalize across talkers. By contrast, several studies concerned with phonetic recalibration have found cross-talker generalization of perceptual learning following exposure to a single talker (Eisner & McQueen, 2005; Kraljic & Samuel, 2006, 2007; see also Weatherholtz, 2015). The likelihood of cross-talker generalization following exposure to a single talker seems to depend largely on the acoustic similarity between the familiar and new talkers (Eisner & McQueen, 2005; Kraljic & Samuel, 2007; Reinisch & Holt, 2014). Thus, when listeners do not have evidence that a particular pattern of variation is systematic across talkers (as in the case of single talker exposure), listeners appear to adapt talker-specifically and only generalize learning to acoustically similar tokens produced by other talkers (i.e., generalization based on stimulus similarity). But when listeners have evidence of variation that is systematic across talkers (as in the case of multi-talker exposure), listeners adapt by learning talker-independent patterns of variation (see Kleinschmidt & Jaeger, 2015b; Weatherholtz, 2015 for additional discussion).

One of the current goals of research in this domain is to provide a formal account of plasticity in speech perception that accounts for adaptation and generalization. One approach, the ideal adapter, is a computational-level framework for understanding the processes involved in reliably mapping acoustic cues to speech categories across talkers (Kleinschmidt & Jaeger, 2015b). According to the ideal adapter framework, speech perception is a process of inference under uncertainty: listeners infer the category that the talker intended to produce based on the observed acoustic cue values and relevant prior knowledge. Critically, prior knowledge is assumed to also comprise distributional statistics that capture how variability in the acoustic realization of speech sounds covaries with indexical information:4 e.g., how the distribution of acoustic cue values associated with /s/ vs. /ʃ/ or with the vowels in near vs. square vary across talkers and social groups. By drawing on implicit knowledge of talker-specific and group-specific distributional statistics, listeners are able to infer whether an observed constellation of acoustic cue values map to /s/ or /ʃ/, for example, based on indexical information about the talker (see Figure 1 for a schematic example of this logic). When listeners lack relevant (or robust) talker-specific knowledge—as when encountering a novel talker—they can generalize based on prior experience with similar talkers (e.g., talkers with the same or similar accent). The ideal adapter model further predicts that listeners continuously update their beliefs about talker-specific and group-specific distributions as they experience new input from familiar and new talkers. Thus, category inferences are predicted to change as the short-term and long-term statistics of the environment change (Kleinschmidt & Jaeger, 2015b).

A closely related view, known as computing cues relative to expectations (C-CuRE), focuses on the cues themselves—rather than the cue-to-category mapping process—and suggests that the perceptual encoding of acoustic cues is a dynamic and talker-contingent process (Cole, Linebaugh, Munson, & McMurray, 2010; McMurray & Jongman, 2011; see also Kleinschmidt & Jaeger, 2015a). As listeners identify other category-relevant sources of variability (e.g., identifying the talker or the talker’s social group membership), the acoustic cue values are recoded in terms of their difference from expected values. C-CuRE can be seen as a specific algorithm (potentially one of several) by which the computational-level ideal adapter framework is implemented. Consider the minimal pair beach-peach. Fundamental frequency (F0) at vowel onset is a secondary cue to the voiced-voiceless (e.g., /b/ − /p/) contrast in English, with relatively higher F0 for voiceless segments. Since F0 varies systematically as a function of vocal tract length, an F0 value that is high for a male talker might be low for a female talker. Thus, knowing the raw F0 value at vowel onset is not particularly informative about whether the preceding segment is voiced or voiceless. If listeners have independently identified the talker’s gender (or the identity of the talker), raw F0 can be recoded in terms of its difference from the expected gender-specific (or talker-specific) mean F0. C-CuRE is thus also related to the normalization accounts discussed above: recoding acoustic values in terms of their difference from expected values emphasizes the spectrally contrastive nature of speech perception (see, e.g., Holt, 2005; Holt & Lotto, 2002; Huang & Holt, 2009), and computing the difference relative to expected talker means partials out (normalizes) talker variability.

5. Open Questions and Directions for Future Research

Research in speech perception has come a long way in understanding how variable speech input is mapped to linguistic categories in memory: from simply assuming invariance to modeling a highly complex, layered system that draws on statistical information in the speech signal. We have discussed several aspects of the speech perception system that enable listeners to cope with talker variability: sensitivity to articulatory gestures (and recovery of articulatory information from the speech signal), normalization, memory for episodic detail, and perceptual and distributional learning mechanisms that are sensitive to patterns of variation in speech. At least some of these mechanisms are automatic and appear to operate during the early stages of processing (i.e., pre-categorical, pre-speech mechanisms). It is an open question to what extent these early processes involve learning: for example, learning early in development that F0 and F3 are correlated with vocal tract length and, hence, learning that these cue dimensions can be used to normalize variability resulting from individual differences in vocal anatomy. Similarly, it is an open question as to how flexible low-level processes are. While there is evidence that low-level auditory processes engage in distributional learning (for discussion, see Kleinschmidt & Jaeger, 2015a), it is not yet known whether the types of distributional learning involved in adapting to talker and accent variability are neurally coded early (for discussion, see Goslin, Duffy, & Floccia, 2012) and further whether sensitivity to covariation between social factors and speech cues is partly due to low-level processes.

The debate between normalization accounts and episodic/exemplar-based accounts of speech perception warrants further discussion. The fact that episodic details of speech stimuli are retained in memory and affect speech perception is one of the primary challenges to normalization accounts (Johnson, 2005). However, it is important to note that, in principle, abstracting away from variance is orthogonal to whether fine acoustic details are retained in memory. That is, normalization accounts that aim to identify relational invariants (e.g., vowel formant ratios that are stable across talkers) do not, in principle, require fine-grained stimulus details to be “filtered out,” forgotten or otherwise inconsequential for speech perception. Thus, the fact that speech perception is sensitive to episodic detail indicates that normalization accounts, as typically formulated, are insufficient, but does not rule out normalization altogether. Likewise, episodic and exemplar-based theories (e.g., Goldinger, 1996; Johnson, 1997) do not provide a straightforward account for some of the strongest evidence of normalization—i.e., that speech sounds are interpreted relative to frequency information in the surrounding context, even when this “context” is non-speech sine-wave tones (Laing et al., 2012). Thus, like normalization accounts, episodic models alone are insufficient to explain how the speech perception system copes with variability, despite evidence that episodic information plays an important role in recognition and categorization processes (Bradlow et al., 1999; Palmeri et al., 1993).

We take the integration of these sometimes conflicting—though not necessarily incompatible—views to be one of the big open questions in research on speech perception. Similar open questions pertain to the relation between abstractionist prototype accounts of speech perception (such as parametric Bayesian accounts) and non-prototype accounts (such as, e.g., exemplar-based accounts). It is now increasingly assumed that the representations that subserve speech perception likely involve both storage of specific exemplars and abstractions over these exemplars (Goldinger, 2007; Kleinschmidt & Jaeger, 2015b; McQueen et al., 2006; Pierrehumbert, 2001). Integrating these views will be critical in understanding how low-level pre-speech and higher-level speech processes jointly achieve relative invariance—the ability to robustly recognize speech categories across talkers.

Acknowledgments

This work was partially funded by a NICHD R01 HD075797 to T. Florian Jaeger and a Graduate Enrichment Fellowship from The Ohio State University to Kodi Weatherholtz. We are grateful to Arthur Samuel, Sheila Blumstein, Beth Hume, Jessamyn Schertz, and two anonymous reviewers for comments on an early version of this article. All errors or oversights are our own. The views expressed here are those of the authors and not necessarily those of the funding agencies.

Further Reading

Baese-Berk, M. M., Bradlow, A. R., & Wright, B. A. (2013). Accent-independent adaptation to foreign accented speech. Journal of the Acoustical Society of America, 133(3), 174–180. doi:10.1121/1.478986Find this resource:

Barreda, S., & Nearey, T. M. (2013). Training listeners to report the acoustic correlate of formant-frequency scaling using synthetic voices. Journal of the Acoustical Society of America, 133(2), 1065–1077. doi:10.1121/1.4773858Find this resource:

Clarke, C., & Garrett, M. F. (2004). Rapid adaptation to foreign-accented English. Journal of the Acoustical Society of America, 116, 3647–3658. doi:10.1121/1.1815131Find this resource:

Dahan, D., Drucker, S. J., & Scarborough, R. A. (2008). Talker adaptation in speech perception: Adjusting the signal or the representations. Cognition, 108(3), 710–718. doi:10.1016/j.cognition.2008.06.003Find this resource:

Dahan, D., & Mead, R. (2010). Context-conditioned generalization in adaptation to distorted speech. Journal of Experimental Psychology: Human Perception and Performance, 36(3), 704–728. doi:10.1037/a0017449Find this resource:

Dupoux, E., & Green, K. (1997). Perceptual adjustment to highly compressed speech: Effects of talker and rate changes. Journal of Experimental Psychology: Human Perception and Performance, 23, 914–927. doi:10.1037/0096-1523.23.3.914Find this resource:

Eisner, F., Melinger, A., & Weber, A. (2013). Constraints on the transfer of perceptual learning in accented speech. Frontiers in Psychology, 4, 1–9. doi:10.3389/fpsyg.2013.00148Find this resource:

Foulkes, P., & Docherty, G. (2006). The social life of phonetics and phonologyJournal of Phonetics, 34, 409–438. doi:10.1016/j.wocn.2005.08.002Find this resource:

Jones, B. C., Feinberg, D. R., Bestelmeyer, P. E. G., DeBruine, L. M., & Little, A. C. (2010). Adaptation to different mouth shapes influences visual perception of ambiguous lip speechPsychonomic Bulletin & Review, 17(4), 522–528. doi:10.3758/PBR.17.4.522Find this resource:

Jongman, A., Wade, T., & Sereno, J. (2003). On improving the perception of foreign-accented speech. In M. J. Sole, D. Recasens, & J. Romero (Eds.), Proceedings of the 15th International Congress on Phonetic Sciences (pp. 1561–1564). Barcelona: Universitat Autònoma de Barcelona.Find this resource:

Kouider, S., & Dupoux, E. (2005). Subliminal speech priming. Psychological Science, 16(6), 617–625. doi:10.1111/j.1467-9280.2005.01584.xFind this resource:

Kuhl, P. K. (1979). Speech perception in early infancy: Perceptual constancy for spectrally dissimilar vowel categories. Journal of the Acoustical Society of America, 66(6), 1668–1679. doi:10.1121/1.383639Find this resource:

Kuhl, P. K. (1983). Perception of auditory equivalence classes for speech in early infancy. Infant Behavior and Development, 6(2–3), 263–285. doi:10.1016/S0163-6383(83)80036-8Find this resource:

Mirman, D., McClelland, J. L., & Holt, L. L. (2006). An interactive Hebbian account of lexically guided tuning of speech perception. Psychonomic Bulletin & Review, 13(6), 958–965. doi:10.3758/BF03213909Find this resource:

Mulak, K. E., & Best, C. T. (2013). Development of word recognition across speakers and accents. In L. Gogate & G. Hollich (Eds.), Theoretical and computational models of word learning: Trends in psychology and artificial intelligence (pp. 242–269). Hershey, PA: Information Science Reference. doi:10.4018/978-1-4666-2973-8.ch011Find this resource:

Myers, E. B., Blumstein, S. E., Walsh, E., & Eliassen, J. (2009). Inferior frontal regions underlie the perception of phonetic category invariance. Psychological Science, 20(7), 895–903. doi:10.1m/j.1467-9280.2009.02380.xFind this resource:

Nearey, T. M., & Assmann, P. F. (2007). Probabilistic “sliding-template” models for indirect vowel normalization. In M.-J. Sole, P. S. Beddor, & M. Ohala (Eds.), Experimental approaches to phonology (pp. 246–269). Oxford: Oxford University Press.Find this resource:

Sussman, H. M., McCaffrey, H. A., & Matthews, S. A. (1991). An investigation of locus equations as a source of relational invariance for stop place categorization. Journal of the Acoustical Society of America, 90(3), 1309–1325. doi:10.1121/1.401923Find this resource:

van der Zande, P., Jesse, A., & Cutler, A. (2014). Cross-speaker generalisation in two phoneme-level perceptual adaptation processes. Journal of Phonetics, 43, 38–46. doi:10.1016/j.wocn.2014.01.003Find this resource:

Witteman, M. J., Weber, A., & McQueen, J. M. (2014). Tolerance for inconsistency in foreign-accented speech. Psychonomic Bulletin & Review, 21, 512–519. doi:10.3758/s13423-013-0519-8Find this resource:

References

Adank, P., Smits, R., & van Hout, R. (2004). A comparison of vowel normalization procedures for language variation research. Journal of the Acoustical Society of America, 116(5), 3099–3107. doi:10.1121/1.1795335Find this resource:

Ainsworth, W. A. (1975). Intrinsic and extrinsic factors in vowel judgments. In G. Fant & M. A. A. Tatham (Eds.), Auditory analysis and perception of speech (pp. 103–111). London: Academic Press.Find this resource:

Aslin, R. N., & Newport, E. L. (2012). Statistical learning: From acquiring specific items to forming general rules. Current Directions in Psychological Science, 21, 170–176. doi:10.1177/0963721412436806Find this resource:

Baese-Berk, M. M., Bradlow, A. R., & Wright, B. A. (2013). Accent-independent adaptation to foreign accented speech. Journal of the Acoustical Society of America, 133(3), 174–180. doi:10.1121/1.478986Find this resource:

Barreda, S. (2012). Vowel normalization and the perception of speaker changes: An exploration of the contextual tuning hypothesis. Journal of the Acoustical Society of America, 132(5), 3453–3464. doi:10.1121/1.4747011Find this resource:

Bejjanki, V. R., Clayards, M., Knill, D. C., & Aslin, R. N. (2011). Cue integration in categorical tasks: Insights from audio-visual speech perception. PLOS One, 6(5), e19812. doi:10.1371/journal.pone.0019812Find this resource:

Bertelson, P., Vroomen, J., & de Gelder, B. (2003). Visual recalibration of auditory speech identification: A McGurk after effect. Psychological Science, 14, 592–597. doi:10.1046/j.0956-7976.2003.psci_1470.xFind this resource:

Best, C. T. (1995). A direct realist view of cross-language speech perception. In W. Strange (Ed.), Speech perception and linguistic experience: Theoretical and methodological issues (pp. 171–204). Timonium, MD: New York Press.Find this resource:

Bizley, J. K., Walker, K. M. M., King, A. J., & Schnupp, J. W. H. (2013). Spectral timbre perception in ferrets: Discrimination of artificial vowels under different listening conditions. Journal of the Acoustical Society of America, 133(1), 365–376. doi:10.1121/1.4768798Find this resource:

Bladon, R. A. W., Henton, C. G., & Pickering, J. B. (1984). Towards an auditory theory of speaker normalization. Language Communication, 4(1), 59–69. doi:10.1016/0271-5309(84)90019-3Find this resource:

Blumstein, S. E., & Stevens, K. N. (1979). Acoustic invariance in speech production: Evidence from measurements of the spectral characteristics of stop consonants. Journal of the Acoustical Society of America, 66, 1001–1018. doi:10.1121/1.383319Find this resource:

Bradlow, A. R., & Bent, T. (2008). Perceptual adaptation to non-native speech. Cognition, 106(2), 707–729. doi:10.1016/j.cognition.2007.04.005Find this resource:

Bradlow, A. R., Nygaard, L. C., & Pisoni, D. B. (1999). Effects of talker, rate, and amplitude variation on recognition memory for spoken words. Perception & Psychophysics, 61(2), 206219. doi:10.3758/BF03206883Find this resource:

Brenner, N., Bialek, W., & de Ruyter van Steveninck, R. R. (2000). Adaptive rescaling maximizes information transmission. Neuron, 26(3), 695–702. doi:10.1016/S0896-6273(00)81205-2Find this resource:

Burdick, C. K., & Miller, J. D. (1975). Speech perception by the chinchilla: Discrimination of sustained /a/ and /i/. Journal of the Acoustical Society of America, 58(2), 415–427. doi:10.1121/1.380686Find this resource:

Campbell-Kibler, K. (2007). Accent, (ING), and the social logic of listener perception. American Speech, 82(1), 32–64 doi:10.1215/00031283-2007-002Find this resource:

Chiba, T., & Kajiyama, M. (1941). The vowel: Its nature and structure. Tokyo: Tokyo Publishing Company.Find this resource:

Claes, T., Dologlous, I., Bosch, L. T., & van Compernolle, D. (1998). A novel feature transformation for vocal tract length normalization in automatic speech recognition. IEEE Transactions on Speech and Audio Processing, 6(6), 549–557. doi:10.1109/89.725321Find this resource:

Clarke, C., & Garrett, M. F. (2004). Rapid adaptation to foreign-accented English. Journal of the Acoustical Society of America, 116, 3647–3658. doi:10.1121/1.1815131Find this resource:

Clayards, M., Tanenhaus, M. K., Aslin, R. N., & Jacobs, R. A. (2008). Perception of speech reflects optimal use of probabilistic speech cues. Cognition, 108(3), 804–809. doi:10.1016/j.cognition.2008.04.004Find this resource:

Clopper, C. G., & Pisoni, D. B. (2004a). Effects of talker variability on perceptual learning of dialects. Language and Speech, 47(Pt 3), 207–239. doi:10.1177/00238309040470030101Find this resource:

Clopper, C. G., & Pisoni, D. B. (2004b). Homebodies and army brats: Some effects of early linguistic experience and residential history on dialect categorization. Language Variation and Change, 16, 31–48. doi:10.1017/S0954394504161036Find this resource:

Clopper, C. G., & Pisoni, D. B. (2007). Free classification of regional dialects of American English. Journal of Phonetics, 35, 421–438. doi:10.1016/j.wocn.2006.06.001Find this resource:

Cole, J., Linebaugh, G., Munson, C. M., & McMurray, B. (2010). Unmasking the acoustic effects of vowel-to-vowel coarticulation: A statistical modeling approach. Journal of Phonetics, 38, 167–184. doi:10.1016/j.wocn.2009.08.004Find this resource:

Cole, R. A., & Scott, B. (1974). Toward a theory of speech perception. Psychological Review, 81(4), 348–374. doi:10.1037/h0036656Find this resource:

Creel, S. C. (2014). Preschoolers’ flexible use of talker information during word learning. Journal of Memory and Language, 73, 81–98. doi:10.1016/j.jml.2014.03.001Find this resource:

Creel, S. C., & Bregman, M. R. (2011). How talker identity relates to language processing. Language and Linguistics Compass, 5(5), 190–204. doi:10.1111/j.1749-818X.2011.00276.xFind this resource:

Cutler, A., Eisner, F., McQueen, J. M., & Norris, D. (2010). How abstract phonemic categories are necessary for coping with speaker-related variation. In C. Fougeron, B. Kühnert, M. D’Imperio, & N. Vallée (Eds.), Laboratory phonology 10 (pp. 91–111). Berlin: de Gruyter.Find this resource:

Delattre, P. C., Liberman, A. M., & Cooper, F. S. (1955). Acoustic loci and transitional cues for consonants. Journal of the Acoustical Society of America, 27(4), 769–773. doi:10.1121/1.1908024Find this resource:

Dewson, J. H. (1964). Speech sound discrimination by cats. Science, 144(3618), 555–556. doi:10.1126/science.144.3618.555Find this resource:

Disner, S. F. (1980). Evaluation of vowel normalization procedures. Journal of the Acoustical Society of America, 67(1), 253–261. doi:10.1121/1.383734Find this resource:

Drager, K. (2011). Speaker age and vowel perception. Language and Speech, 54, 99–121. doi:10.1177/0023830910388017Find this resource:

Eisner, F. (2012). Perceptual learning in speech. In N. Seel (Ed.), Encyclopedia of the science of learning (pp. 2583–2584). Berlin: Springer.Find this resource:

Eisner, F., & McQueen, J. M. (2005). The specificity of perceptual learning in speech processing. Perception & Psychophysics, 67(2), 224–238. doi:10.3758/BF03206487Find this resource:

Elman, J. L., & McClelland, J. L. (1986). Exploiting lawful variability in the speech wave. In J. S. Perkell & D. H. Klatt (Eds.), Invariance and variability in speech processes (pp. 360–380). Hillsdale, NJ: Lawrence Erlbaum.Find this resource:

Eriksson, J. L., & Villa, A. E. P. (2006). Learning of auditory equivalence classes for vowels by rats. Behavioural Processes, 73(3), 348–359. doi:10.1016/j.beproc.2006.08.005Find this resource:

Fairhall, A. L., Lewen, G. D., Bialek, W., & de Ruyter van Steveninck, R. R. (2001). Efficiency and ambiguity in an adaptive neural code. Nature, 412(23), 787–792. doi:10.1038/35090500Find this resource:

Fant, G. (1960). Acoustic theory of speech production. The Hague: Mouton.Find this resource:

Fine, A. B., Jaeger, T. F., Farmer, T. A., & Qian, T. (2013). Rapid expectation adaptation during syntactic comprehension. PLoS One, 8. doi:10.1371/journal.pone.0077661Find this resource:

Fitch, W. T., & Giedd, J. (1999). Morphology and development of the human vocal tract: A study using magnetic resonance imaging. Journal of the Acoustical Society of America, 106(3), 1511–1522. doi:10.1121/1.427148Find this resource:

Foulkes, P., & Hay, J. (2015). The emergence of sociophonetic structure. In B. MacWhinney & W. O’Grady (Eds.), The handbook of language emergence (pp. 292–313). Hoboken, NJ: John Wiley. doi:10.1002/9781118346136.ch13Find this resource:

Fowler, C. A. (1986). An event approach to the study of speech perception from a direct-realist perspective. Journal of Phonetics, 14, 3–28.Find this resource:

Fowler, C. A. (1991). Auditory perception is not special: We see the world, we feel the world, we hear the world. Journal of the Acoustical Society of America, 89, 2910–2915. doi:10.1121/1.400729Find this resource:

Fowler, C. A., & Deckle, D. J. (1991). Listening with eye and hand: Cross-modal contributions to speech perception. Journal of Experimental Psychology: Human Perception and Performance, 17(3), 816–821. doi:10.1037/0096-1523.17.3.816Find this resource:

Fox, R. A., Flege, J. E., & Munro, M. J. (1995). The perception of English and Spanish vowels by native English and Spanish listeners: A multidimensional scaling analysis. Journal of the Acoustical Society of America, 97(4), 2540–2551. doi:10.1121/1.411974Find this resource:

Fox, R. A., & Qi, Y.-Y. (1990). Context effects in the perception of lexical tone. Journal of Chinese Linguistics, 18, 261–283.Find this resource:

Galantucci, B., Fowler, C. A., & Turvey, M. T. (2006). The motor theory of speech perception reviewed. Psychonomic Bulletin & Review, 13(3), 361–377. doi:10.3758/BF03193857Find this resource:

Gass, S., & Varonis, E. M. (1984). The effect of familiarity on the comprehensibility of nonnative speech. Language Learning, 34(1). doi:10.1111/j.1467-1770.1984.tb00996.xFind this resource:

Gibson, E. J. (1969). Principles of perceptual learning and development. New York: Appleton-Century-Crofts.Find this resource:

Gibson, J. J. (1966). The senses considered as perceptual systems. Boston: Houghton-Mifflin.Find this resource:

Gibson, J. J., & Gibson, E. J. (1955). Perceptual learning: Differentiation or enrichment?. Psychological Review, 105, 251–279. doi:10.1037/h0048826Find this resource:

Goldinger, S. D. (1996). Words and voices: Episodic traces in spoken word identification and recognition memory. Journal of Experimental Psychology: Learning, Memory, & Cognition, 22(5), 1166–1183. doi:10.1037/0278-7393.22.5.1166Find this resource:

Goldinger, S. D. (1998). Echoes of echoes? An episodic theory of lexical access. Psychological Review, 105, 251–279. doi:10.1037/0033-295X.105.2.251Find this resource:

Goldinger, S. D. (2007). A complementary-systems approach to abstract and episodic speech perception. In Proceedings of the 16th International Congress of Phonetic Sciences (ICPhS) (pp. 49–54). Dudweiler: Pirrot GmbH.Find this resource:

Goldstone, R. L. (1998). Perceptual learning. Annual Review of Psychology, 49, 585–612. doi:10.1146/annurev.psych.49.1.585Find this resource:

Goslin, J., Duffy, H., & Floccia, C. (2012). An ERP investigation of foreign and regional accent processing. Brain and Language, 122(2), 92–102. doi:10.1016/j.bandl.2012.04.017Find this resource:

Greenspan, S. L., Nusbaum, H. C., & Pisoni, D. B. (1988). Perceptual learning of synthetic speech produced by rule. Journal of Experimental Psychology: Learning, Memory & Cognition, 14(3), 421–433. doi:10.1037/0278-7393.14.3.421Find this resource:

Greenwood, D. D. (1961). Auditory masking and the critical band. Journal of the Acoustical Society of America, 33(4), 484–502. doi:10.1121/1.1908699Find this resource:

Gutnisky, D. A., & Dragoi, V. (2008). Adaptive coding of visual information in neural populations. Nature, 452(13), 220–224. doi:10.1038/nature06563Find this resource:

Halberstam, B., & Raphael, L. J. (2004). Vowel normalization: The role of fundamental frequency and upper formants. Journal of Phonetics, 32, 423–434. doi:0.1016/j.wocn.2004.03.001Find this resource:

Halle, M., Hughes, G. W., & Radley, J.-P. A. (1957). Acoustic properties of stop consonants. Journal of the Acoustical Society of America, 29(1), 107–116. doi:10.1121/1.1908634Find this resource:

Hay, J., & Drager, K. (2010). Stuffed toys and speech perception. Linguistics, 48(41), 865–892. doi:10.1515/LING.2010.027Find this resource:

Hay, J., Nolan, A., & Drager, K. (2006). From fush to feesh: Exemplar priming in speech perception. The Linguistic Review, 23, 351–379. doi:10.1515/TLR.2006.014Find this resource:

Hay, J., Warren, P., & Drager, K. (2006). Factors influencing speech perception in the context of a merger-in-progress. Journal of Phonetics, 34, 458–484. doi:10.1016/j.wocn.2005.10.001Find this resource:

Heald, S. L. M., & Nusbaum, H. C. (2015). Variability in vowel production within and between days. PLOS One, 10(9). doi:10.1371/journal.pone.0136791Find this resource:

Hickok, G., Costanzo, M., Capasso, R., & Miceli, G. (2011). The role of Broca’s area in speech perception: Evidence from aphasia revisited. Brain and Language, 119(3), 214–220. doi:10.1016/j.bandl.2011.08.001Find this resource:

Hillenbrand, J., Getty, L. A., Clark, M. J., & Wheeler, K. (1995). Acoustic characteristics of American English vowels. Journal of the Acoustical Society of America, 97(5), 3099–3111. doi:10.1121/1.411872Find this resource:

Hirahara, T., & Kato, H. (1992). The effect of F0 on vowel identification. In Y. Tohkura, E. Vatikiotis-Bateson, & R. Sagisaka (Eds.), Speech perception, production and linguistic structure (pp. 89–112). Tokyo: Ohmsha Publishing.Find this resource:

Hogden, J., Rubin, P., McDermott, E., Katagiri, S., & Goldstein, L. (2007). Inverting mappings from smooth paths through Rn to paths through Rm: A technique applied to recovering articulation from acoustics. Speech Communication, 49(5), 361–383. doi:10.1016/j.specom.2007.02.008Find this resource:

Holt, L. L. (2005). Temporally nonadjacent nonlinguistic sounds affect speech categorization. Psychological Science, 16(4), 305–312. doi:10.1111/j.0956-7976.2005.01532.xFind this resource:

Holt, L. L. (2006). Speech categorization in context: Joint effects of nonspeech and speech precursors. Journal of the Acoustical Society of America, 119(6), 4016–4026. doi:10.1121/1.2195119Find this resource:

Holt, L. L., & Lotto, A. J. (2002). Behavioral examinations of the level of auditory processing of speech context effects. Hearing Research, 167(1–2), 156–169. doi:10.1016/S0378-5955(02)00383-0Find this resource:

Huang, J., & Holt, L. L. (2009). General perceptual contributions to lexical tone normalization. Journal of the Acoustical Society of America, 125(6), 3983–3994. doi:10.1121/1.3125342Find this resource:

Huang, J., & Holt, L. L. (2011). Evidence for the central origin of lexical tone normalization. Journal of the Acoustical Society of America, 129(3), 1145–1148. doi:10.1121/1.3543994Find this resource:

Huber, J. E., Stathopoulos, E. T., Curione, G. M., Ash, T. A., and Johnson, K. (1999). Formants of children, women, and men: The effects of vocal intensity variation. Journal of the Acoustical Society of America, 106(3), 1532–1542. doi:10.1121/1.427150Find this resource:

Idemaru, K., & Holt, L. L. (2011). Word recognition reflects dimension-based statistical learning. Journal of Experimental Psychology: Human Perception and Performance, 37, 1939–1956. doi:10.1037/a0025641Find this resource:

Irino, T., & Patterson, R. D. (2002). Segregating information about the size and shape of the vocal tract using a time-domain auditory model: The stabilised wavelet-Mellin transform. Speech Communication, 36(3–4), 181–203. doi:10.1016/S0167-6393(00)00085-6Find this resource:

Iskarous, K., Fowler, C. A., & Whalen, D. H. (2010). Locus equations are an acoustic expression of articulator synergy. Journal of the Acoustical Society of America, 128(4), 2021–2032. doi:10.1121/1.3479538Find this resource:

Johnson, K. (1990). The role of perceived speaker identity in F0 normalization of vowels. Journal of the Acoustical Society of America, 88(2), 642–654. doi:10.1121/1.399767Find this resource:

Johnson, K. (1991). Differential effects of speaker and vowel variability on fricative perception. Language and Speech, 34, 265–279. doi:10.1177/002383099103400304Find this resource:

Johnson, K. (1997). Speech perception without speaker normalization. In K. Johnson & J. W. Mullennix (Eds.), Talker variability in speech processing (pp. 146–165). San Diego, CA: Academic Press.Find this resource:

Johnson, K. (2005). Speaker normalization in speech perception. In D. B. Pisoni & R. E. Remez (Eds.), The handbook of speech perception (pp. 363–389). Oxford: Blackwell.Find this resource:

Johnson, K. (2006). Resonance in an exemplar-based lexicon: The emergence of social identity and phonology. Journal of Phonetics, 34(4), 485–499. doi:10.1016/j.wocn.2005.08.004Find this resource:

Jongman, A., Wayland, R., & Wong, S. (2000). Acoustic characteristics of English fricatives. Journal of the Acoustical Society of America, 108, 1252–1263.Find this resource:

Joos, M. A. (1948). Acoustic phonetics. Language, 24, 1–136.Find this resource:

Katz, W. F., & Assmann, P. F. (2001). Identification of children’s and adults’ vowels: Intrinsic fundamental frequency, fundamental frequency dynamics, and presence of voicing. Journal of Phonetics, 29(1), 33–51. doi:10.006/jpho.2000.0135Find this resource:

Kewley-Port, D. (1983). Time-varying features as correlates of place of articulation in stop consonants. Journal of the Acoustical Society of America, 73, 1779–1793. doi:10.1121/1.388813Find this resource:

Klatt, D. H. (1986). The problem of variability in speech recognition and models of speech perception. In J. S. Perkell & D. H. Klatt (Eds.), Invariance and variability in speech processing (pp. 300–319). Hillsdale, NJ: Lawrence Erlbaum.Find this resource:

Klatt, D. H. (1987). Review of text-to-speech conversion for English. Journal of the Acoustical Society of America, 82(3), 737–793. doi:10.1121/1.395275Find this resource:

Kleinschmidt, D. F., & Jaeger, T. F. (2011). A Bayesian belief updating model of phonetic recalibration and selective adaptation. In Proceedings of the 2nd ACL Workshop on Cognitive Modeling and Computational Linguistics (pp. 10–19). Portland, OR: Omnipress, Inc.Find this resource:

Kleinschmidt, D. F., & Jaeger, T. F. (2012). A continuum of phonetic adaptation: Evaluating an incremental belief-updating model of recalibration and selective adaptation. In N. Miyake, D. Peebles, & R. P. Cooper (Eds.), Proceedings of the 34th Annual Conference of the Cognitive Science Society (pp. 599–604). Austin, TX: Cognitive Science Society.Find this resource:

Kleinschmidt, D. F., & Jaeger, T. F. (2015a). Re-examining selective adaptation: Fatiguing feature detectors, or distributional learning?. Psychonomic Bulletin & Review, 23(3), 678–691. doi:10.3758/s13423-015-0943-zFind this resource:

Kleinschmidt, D. F., & Jaeger, T. F. (2015b). Robust speech perception: Recognize the familiar, generalize to the similar, and adapt to the novel. Psychological Review, 122(2), 148–203. doi:10.1037/a0038695Find this resource:

Kleinschmidt, D. F., Raizada, R., & Jaeger, T. F. (2015). Supervised and unsupervised learning in phonetic adaptation. In D. C. Noelle, R. Dale, A. S. Warlaumont, J. Yoshimi, T. Matlock, C. D. Jennings, & P. P. Maglio (Eds.), Proceedings of the 37th Annual Meeting of the Cognitive Science Society (pp. 1129–1134). Austin, TX: Cognitive Science Society.Find this resource:

Koops, C., Gentry, E., & Pantos, A. (2008). The effect of perceived speaker age on the perception of PIN and PEN vowels in Houston, Texas. University of Pennsylvania Working Papers in Linguistics: Selected Papers from NWAV 36, 12, 91–101.Find this resource:

Kraljic, T., Brennan, S. E., & Samuel, A. G. (2008). Accommodating variation: Dialects, idiolects, and speech processing. Cognition, 107, 54–81. doi:10.1016/j.cognition.2007.07.013Find this resource:

Kraljic, T., & Samuel, A. G. (2005). Perceptual learning for speech: Is there a return to normal?. Cognitive Psychology, 51, 141–178. doi:10.1016/j.cogpsych.2005.05.001Find this resource:

Kraljic, T., & Samuel, A. G. (2006). Generalization in perceptual learning for speech. Psychonomic Bulletin and Review, 13, 262–268. doi:10.3758/BF03193841Find this resource:

Kraljic, T., & Samuel, A. G. (2007). Perceptual adjustments to multiple speakers. Journal of Memory and Language, 56, 1–15.Find this resource:

Kriengwatana, B., Escudero, P., & ten Cate, C. (2015). Revisiting vocal perception in non-human animals: A review of vowel discrimination, speaker voice recognition, and speaker normalization. Frontiers in Psychology, 15(1543), 1–13. doi:10.3389/fpsyg.2014.01543Find this resource:

Kuhl, P. K., & Miller, J. D. (1975). Speech perception by the chinchilla: Voiced-voiceless distinction in alveolar plosive consonants. Science, 190(4209), 69–72. doi:10.1126/science.1166301Find this resource:

Labov, W. (1966). The social stratification of English in New York City. Washington, DC: Center for Applied Linguistics.Find this resource:

Labov, W., Ash, S., & Boberg, C. (2006). The atlas of North American English. Berlin: Mouton de Gruyter.Find this resource:

Ladefoged, P. (1980). What are linguistic sounds made of?. Language, 56(3), 485–502. doi:10.2307/414446Find this resource:

Ladefoged, P. (1989). A note on “information conveyed by vowels”. Journal of the Acoustical Society of America, 85(5), 2223–2224. doi:10.1121/1.397821Find this resource:

Ladefoged, P., & Broadbent, D. E. (1957). Information conveyed by vowels. Journal of the Acoustical Society of America, 29, 98–104.Find this resource:

Lahiri, A., Gewirth, L., & Blumstein, S. E. (1984). A reconsideration of acoustic invariance for place of articulation in diffuse stop consonants: Evidence from a cross-language study. Journal of the Acoustical Society of America, 76(2), 391–404. doi:10.1121/1.391580Find this resource:

Laing, E. J. C., Liu, R., Lotto, A. J., & Holt, L. L. (2012). Tuned with a tune: Talker normalization via general auditory processes. Frontiers in Psychology, 3(203). doi:10.3389/fpsyg.2012.00203Find this resource:

Lee, S., Potamianos, A., & Narayanan, S. S. (1999). Acoustics of children’s speech: Developmental changes of temporal and spectral parameters. Journal of the Acoustical Society of America, 105(3), 1455–1468. doi:10.1121/1.426686Find this resource:

Liberman, A. M. (1982). On finding that speech is special. American Psychologist, 37, 148–167. doi:10.1037/0003-066X.37.2.148Find this resource:

Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert-Kennedy, M. (1967). Perception of the speech code. Psychological Review, 74(6), 431–461. doi:10.1037/h0020279Find this resource:

Liberman, A. M., & Mattingly, I. G. (1985). The motor theory of speech perception revised. Cognition, 21(1), 1–36. doi:10.1016/0010-0277(85)90021-6Find this resource:

Lindblom, B. (1990). Explaining phonetic variation: A sketch of the H&H theory. In W. J. Hardcastle & A. Marchal (Eds.), Speech production and speech modelling (pp. 403–439). Dordrecht, The Netherlands: Kluwer Academic Publishers. doi:10.1007/978-94-009-2037-8_16Find this resource:

Lindblom, B., & Studdert-Kennedy, M. (1967). On the role of formant transitions in vowel recognition. Journal of the Acoustical Society of America, 42, 830–843. doi:10.1121/1.1910655Find this resource:

Liu, R., & Holt, L. L. (2015). Dimension-based statistical learning of vowels. Journal of Experimental Psychology: Human Perception and Performance, 41(6), 1783–1798. doi:10.1037/xhp0000092Find this resource:

Lively, S. E., Logan, J. S., & Pisoni, D. B. (1993). Training Japanese listeners to identify English/r/and/l/: II. The role of phonetic environment and talker variability in learning new perceptual categories. Journal of the Acoustical Society of America, 94, 1242–1255. doi:10.1121/1.408177Find this resource:

Lloyd, R. J. (1890a). Some researches into the nature of vowel-sound. Liverpool, U.K.: Turner and Dunnett.Find this resource:

Lloyd, R. J. (1890b). Speech sounds: Their nature and causation (I). Phonetische Studien, 3, 251–278.Find this resource:

Lloyd, R. J. (1891). Speech sounds: Their nature and causation (II–IV). Phonetische Studien, 4, 37–67, 183–214, 275–306.Find this resource:

Lloyd, R. J. (1892). Speech sounds: Their nature and causation (V–VII). Phonetische Studien, 5, 1–32, 129–141, 263–271.Find this resource:

Lobanov, B. (1971). Classification of Russian vowels spoken by different speakers. Journal of the Acoustical Society of America, 49, 606–608. doi:10.1121/1.1912396Find this resource:

Logan, J. S., Lively, S. E., & Pisoni, D. B. (1991). Training Japanese listeners to identify English/r/and/l/: A first report. Journal of the Acoustical Society of America, 89(2), 874–876. doi:10.1121/1.1894649Find this resource:

Lotto, A., Hickok, G., & Holt, L. L. (2009). Reflections on mirror neurons and speech perception. Trends in Cognitive Science, 13(3), 110–114. doi:10.1016/j.tics.2008.11.008Find this resource:

Mann, V. A. (1980). Influence of preceding liquid on stop-consonant perception. Perception & Psychophysics, 28(5), 407–412. doi:10.3758/BF03204884Find this resource:

Mann, V. A., & Repp, B. H. (1980). Influence of vocalic context on perception of the [s]*−[ʃ]* distinction. Perception & Psychophysics, 23(3), 213–228. doi:10.3758/BF03204377Find this resource:

Marin, S., Pouplier, M., & Harrington, J. (2010). Acoustic consequences of articulatory variability during the productions of/t/and/k/and its implications for speech error research. Journal of the Acoustical Society of America, 127(1), 445–461. doi:10.1121/1.3268600Find this resource:

Maye, J., Aslin, R., & Tanenhaus, M. (2008). The Weckud Wetch of the Wast: Lexical adaptation to a novel accent. Cognitive Science, 32, 543–562. doi:10.1080/03640210802035357Find this resource:

Maye, J., Werker, J. F., & Gerken, L. (2002). Infant sensitivity to distributional information can affect phonetic discrimination. Cognition, 82(3), B101–B111. doi:10.1016/S0010-0277(01)00157-3Find this resource:

McGowan, R. S., & Berger, M. A. (2009). Acoustic-articulatory mapping in vowels by locally weighted regression. Journal of the Acoustical Society of America, 126(4), 2011–2032. doi:10.1121/1.3184581Find this resource:

McGowan, R. S., & Cushing, S. (1999). Vocal tract normalization for mid-sagittal articulatory recovery with analysis-by-synthesis. Journal of the Acoustical Society of America, 106(2), 1090–1105. doi:10.1121/1.427117Find this resource:

McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264, 746–748. doi:10.1038/264746a0Find this resource:

McMurray, B., & Jongman, A. (2011). What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. Psychological Review, 118(2), 219–246. doi:10.1037/a0022325Find this resource:

McQueen, J. M., Cutler, A., & Norris, D. (2006). Phonological abstraction in the mental lexicon. Cognitive Science, 30, 1113–1126. doi:10.1207/s15516709cog0000_79Find this resource:

Miller, J. D. (1989). Auditory-perceptual interpretation of the vowel. Journal of the Acoustical Society of America, 85, 2114–2134. doi:10.1121/1.2029825Find this resource:

Mitra, V., Nam, H., Epsy-Wilson, C., Saltzman, E., & Goldstein, L. (2012). Recognizing articulatory gestures from speech for robust speech recognition. Journal of the Acoustical Society of America, 131(3), 2270–2287. doi:10.1121/1.3682038Find this resource:

Mitterer, H., Chen, Y., & Zhou, X. (2011). Phonological abstraction in processing lexical-tone variation: Evidence from a learning paradigm. Cognitive Science, 35(1), 184–197. doi:10.1111/j.1551-6709.2010.01140.xFind this resource:

Mitterer, H., Scharenborg, O., & McQueen, J. M. (2013). Phonological abstraction without phonemes in speech perception. Cognition, 129(2), 356–361. doi:10.1016/j.cognition.2013.07.011Find this resource:

Monahan, P. J., & Idsardi, W. J. (2010). Auditory sensitivity to formant ratios: Toward an account of vowel normalization. Language and Cognitive Processes, 25(6), 808–839. doi:10.1080/01690965.2010.490047Find this resource:

Moore, C. B., & Jongman, A. (1997). Speaker normalization in the perception of Mandarin Chinese tones. Journal of the Acoustical Society of America, 102(3), 1864–1877. doi:10.1121/1.420092Find this resource:

Myers, E. B., Blumstein, S. E., Walsh, E., & Eliassen, J. (2009). Inferior frontal regions underlie the perception of phonetic category invariance. Psychological Science, 20(7), 895–903. doi:10.1111/j.1467-9280.2009.02380.xFind this resource:

Naeser, M. A., Palumbo, C., Helm-Estabrooks, N., Stiassny-Eder, D., & Albert, M. L. (1989). Severe nonfluency in aphasia: Role of the medical subcallosal fasciculus and other white matter pathways in recovery of spontaneous speech. Brain, 112(1), 1–38. doi:10.1093/brain/112.1.1Find this resource:

Nearey, T. M. (1989). Static, dynamic, and relational properties in vowel perception. Journal of the Acoustical Society of America, 85(5), 2088–2113. doi:10.1121/1.397861Find this resource:

Newman, R. S., & Sawusch, J. S. (1996). Perceptual normalization for speaking rate: Effects of temporal distance. Perception & Psychophysics, 58(4), 540–560. doi:10.3758/BF03213089Find this resource:

Niedzielski, N. (1999). The effect of social information on the perception of sociolinguistic variables. Journal of Language and Social Psychology, 18(1), 62–85. doi:10.1177/0261927X99018001005Find this resource:

Nordström, P.-E., & Lindblom, B. (1975). A normalization procedure for vowel formant data. In Proceedings of the 8th International Congress of Phonetic Sciences (p. 212).Find this resource:

Norris, D. (2006). The Bayesian reader: Explaining word recognition as an optimal Bayesian decision process. Psychological Review, 113(2), 327–357.Find this resource:

Norris, D., & McQueen, J. M. (2008). Shortlist B: A Bayesian model of continuous speech recognition. Psychological Review, 115(2), 357–395. doi:10.1037/0033-295X.115.2.357Find this resource:

Norris, D., McQueen, J. M., & Cutler, A. (2000). Merging information in speech recognition: Feedback is never necessary. Behavioral and Brain Sciences, 23, 299–325.Find this resource:

Norris, D., McQueen, J. M., & Cutler, A. (2003). Perceptual learning in speech. Cognitive Psychology, 47(2), 204–238. doi:10.1016/S0010-0285(03)00006-9Find this resource:

Nusbaum, H., & Magnuson, J. (1997). Talker normalization: Phonetic constancy as a cognitive process. In K. Johnson & J. W. Mullennix (Eds.), Talker variability in speech processing (pp. 109–132). New York: Academic Press.Find this resource:

Nygaard, L. C., & Pisoni, D. B. (1998). Talker-specific learning in speech perception. Perception & Psychophysics, 60(3), 355–376. doi:10.3758/BF03206860Find this resource:

Nygaard, L. C., Sommers, M. C., & Pisoni, D. B. (1994). Speech perception as a talker-contingent process. Psychological Science, 5(1), 42–46. doi:10.1111/j.1467-9280.1994.tb00612.xFind this resource:

Ohman, S. E. G. (1966). Coarticulation in VCV utterances: Spectrographic measurements. The Journal of the Acoustical Society of America, 39(1), 151–168. doi:10.1121/1.1909864Find this resource:

Ohms, V. R., Gill, A., Van Heijningen, C. A. A., Beckers, G. J. L., & ten Cate, C. (2010). Zebra finches exhibit speaker-independent phonetic perception of human speech. Proceedings of the Royal Society B, 277, 1003–1009. doi:10.1098/rspb.2009.1788Find this resource:

Palmeri, T. J., Goldinger, S. D., & Pisoni, D. B. (1993). Episodic encoding of voice attributes and recognition memory for spoken words. Journal of Experimental Psychology: Learning, Memory and Cognition, 19(2), 309–328. doi:10.1037//0278-7393.19.2.309Find this resource:

Perry, T. L., Ohde, R. N., & Ashmead, D. H. (2001). The acoustic bases for gender identification from children’s voices. Journal of the Acoustical Society of America, 109(6), 2988–2998. doi:10.1121/1.1370525Find this resource:

Peterson, G. E. (1952). The information-bearing elements of speech. Journal of the Acoustical Society of America, 24, 629–637. doi:10.1121/1.1906945Find this resource:

Peterson, G. E. (1961). Parameters of vowel quality. Journal of Speech and Hearing Research, 4(1), 10–29. doi:10.1044/jshr.0401.10Find this resource:

Peterson, G. E., & Barney, H. L. (1952). Control methods used in a study of the vowels. Journal of the Acoustical Society of America, 24, 175–184. doi:10.1121/1.1917300Find this resource:

Pierrehumbert, J. B. (2001). Exemplar dynamics: Word frequency, lenition, and contrast. In J. Bybee & P. Hooper (Eds.), Frequency effects and emergent grammar (pp. 137–157). Amsterdam: John Benjamins.Find this resource:

Pierrehumbert, J. B. (2002). Word-specific phonetics. In C. Gussenhoven & N. Warner (Eds.), Laboratory phonology vii (pp. 101–139). Berlin: Mouton de Gruyter.Find this resource:

Pisoni, D. B. (1997). Some thoughts on “normalization” in speech perception. In K. Johnson & J. W. Mullennix (Eds.), Talker variability in speech processing (pp. 9–33). San Diego, CA: Academic Press.Find this resource:

Pisoni, D. B., & Levi, S. V. (2007). Some observations on representations and representational specificity in speech perception and spoken word recognition. In G. Gaskell (Ed.), The Oxford handbook of psycholinguistics (pp. 3–18). New York: Oxford University Press.Find this resource:

Potter, R. K., & Steinberg, J. C. (1950). Toward the specification of speech. Journal of the Acoustical Society of America, 22, 807–820. doi:10.1121/1.1906694Find this resource:

Preston, D. R. (1989). Perceptual dialectology: Nonlinguists’ views of areal linguistics. Providence, RI: Foris.Find this resource:

Protopapas, A., & Lieberman, P. (1997). Fundamental frequency of phonation and perceived emotional stress. Journal of the Acoustical Society of America, 101(4), 2267–2277. doi:10.1121/1.418247Find this resource:

Reinisch, E., & Holt, L. L. (2014). Lexically-guided phonetic retuning of foreign-accented speech and its generalization. Journal of Experimental Psychology: Human Perception and Performance, 40(2), 539–555. doi:10.1037/a0034409Find this resource:

Romero-Rivas, C., Martin, C. D., & Costa, A. (2015). Processing changes when listening to foreign-accented speech. Frontiers in Human Neuroscience, 9, 167. doi:10.3389/fnhum.2015.00167Find this resource:

Samuel, A. G., & Kraljic, T. (2009). Perceptual learning for speech. Attention, Perception, & Psychophysics, 71(6), 1207–1218. doi:10.3758/APP.71.6.1207Find this resource:

Sato, M., Cavé, C., Ménard, L., & Brasseur, A. (2010). Auditory-tactile speech perception in congenitally blind and sighted adults. Neuropsychologia, 48(12), 3683–3686. doi:10.1016/j.neuropsychologia.2010.08.017Find this resource:

Schacter, D. L., & Church, B. A. (1992). Auditory priming: Implicit and explicit memory for words and voices. Journal of Experimental Psychology: Learning, Memory, and Cognition, 18(5), 915–930. doi:10.1037/0278-7393.18.5.915Find this resource:

Schertz, J., Cho, T., Lotto, A., & Warner, N. (2015). Individual differences in phonetic cue use in production and perception of a non-native contrast. Journal of Phonetics, 52, 183–204.Find this resource:

Schroeter, J., & Sondhi, M. M. (1994). Techniques for estimating vocal-tract shapes from the speech signal. IEEE Transactions on Speech and Audio Processing, 2(1), 133–150. doi:10.1109/89.260356Find this resource:

Searle, C. L., Jacobson, J. Z., & Rayment, S. G. (1979). Stop consonant discrimination based on human audition. Journal of the Acoustical Society of America, 65(3), 799–809. doi:10.1121/1.382501Find this resource:

Shankweiler, D., Strange, W., & Verbrugge, R. (1975). Speech and the problem of perceptual constancy. In R. Shaw & J. Bransford (Eds.), Perceiving, acting, and knowing: Toward an ecological psychology (pp. 315–345). Hillsdale, NJ: Erlbaum.Find this resource:

Sharpee, T. O., Sugihara, H., Kurgansky, A. V., Rebrik, S. P., Stryker, M. P., & Miller, K. D. (2006). Adaptive filtering enhances information transmission in visual cortex. Nature, 439(23), 936–942. doi:10.1038/nature04519Find this resource:

Sidaras, S., Alexander, J. E. D., & Nygaard, L. C. (2009). Perceptual learning of systematic variation in Spanish-accented speech. Journal of the Acoustical Society of America, 125(5), 3306–3316. doi:10.1121/1.3101452Find this resource:

Sjerps, M. J., & McQueen, J. M. (2010). The bounds of flexibility in speech perception. Journal of Experimental Psychology: Human Perception and Performance, 36(1), 195–211. doi:10.1037/a0016803Find this resource:

Sjerps, M. J., McQueen, J. M., & Mitterer, H. (2013). Evidence for precategorical extrinsic vowel normalization. Attention, Perception, & Psychophysics, 75, 576–587. doi:10.3758/s13414-012-0408-7Find this resource:

Smith, B. L., & Hayes-Harb, R. (2011). Individual differences in the production and perception of the voicing contrast by native and non-native speakers of English. Journal of Phonetics, 39, 115–120. doi:10.1016/j.wocn.2010.11.005Find this resource:

Sonderegger, M., & Yu, A. C. L. (2010). A rational account of perceptual compensation for coarticulation. In S. Ohlsson & R. Catrambone (Eds.), Proceedings of the 32nd Annual Conference of the Cognitive Science Society (pp. 375–380). Austin, TX: Cognitive Science Society.Find this resource:

Staum Casasanto, L. (2008). Does social information influence sentence processing? In Proceedings of the 30th Annual Meeting of the Cognitive Science Society (pp. 799–804). Washington, DC: Cognitive Science Society.Find this resource:

Stevens, K. N. (1972). The quantal nature of speech: Evidence from articulatory-acoustic data. In E. E. David & P. B. Denes (Eds.), Human communication: A unified view (pp. 51–66). New York: McGraw-Hill.Find this resource:

Stevens, K. N., & Blumstein, S. E. (1977). Onset spectra as cues for consonantal place of articulation. Journal of the Acoustical Society of America, 61, S48. doi:10.1121/1.2015732Find this resource:

Stevens, K. N., & Blumstein, S. E. (1978). Invariant cues for place of articulation in stop consonants. Journal of the Acoustical Society of America, 64(5), 1358–1368. doi:10.1121/1.382102Find this resource:

Stevens, K. N., & Blumstein, S. E. (1981). The search for invariant acoustic correlates of phonetic features. In P. Eimas & J. L. Miller (Eds.), Perspectives on the study of speech (pp. 1–38). Hillsdale, NJ: Erlbaum.Find this resource:

Stocker, A. A., & Simoncelli, E. P. (2006). Sensory adaptation within a Bayesian framework for perception. In Y. Weiss, B. Scholkopf, & J. Platt (Eds.), Advances in neural information processing systems (Vol. 18, pp. 1291–1298). Cambridge, MA: MIT Press.Find this resource:

Story, B. H., & Bunton, K. (2010). Relation of vocal tract shape, formant transitions, and stop consonant identifiation. Journal of Speech, Language and Hearing Research, 53(6), 1514–1528. doi:10.1044/1092-4388(2010/09-0127Find this resource:

Strange, W. (1989). Dynamic specification of coarticulated vowels spoken in sentence context. Journal of the Acoustical Society of America, 85, 2135–2153. doi:10.1121/1.397863Find this resource:

Sumner, M., Kim, S. K., King, E., & McGowan, K. (2014). The socially weighted encoding of spoken words: A dual-route approach to speech perception. Frontiers in Psychology, 4. doi:10.3389/fpsyg.2013.01015Find this resource:

Sussman, H. M. (1986). A neuronal model of vowel normalization and representation. Brain and Language, 28(1), 12–23. doi:10.1016/0093-934X(86)90087-8Find this resource:

Sussman, H. M. (1989). Neural coding of relational invariance in speech: Human language analogs to the barn owl. Psychological Review, 96(4), 631–642. doi:10.1037/0033-295X.96.4.631Find this resource:

Sussman, H. M., Fruchter, D., Hilbert, J., & Sirosh, J. (1998). Linear correlates in the speech signal: The orderly output constraint. Behavioral and Brain Sciences, 21(2), 241–299. doi:10.1017/S0140525X98001174Find this resource:

Syrdal, A., & Gopal, H. (1986). A perceptual model of vowel recognition based on the auditory representation of American English vowels. Journal of the Acoustical Society of America, 79, 1086–1100. doi:10.1121/1.393381Find this resource:

Tartter, V. C. (1991). Identifiability of vowels and speakers from whispered syllables. Perception & Psychophysics, 49(4), 365–372. doi:10.3758/BF03205994Find this resource:

Theodore, R. M., & Miller, J. L. (2010). Characteristics of listener sensitivity to talker-specific phonetic detail. Journal of the Acoustical Society of America, 128(4), 2090–2099. doi:10.1121/1.3467771Find this resource:

Tosacano, J. C., & McMurray, B. (2010). Cue integration with categories: Weighting acoustic cues in speech using unsupervised learning and distributional statistics. Cognitive Science, 34(3), 434–464. doi:10.1111/j.1551-6709.2009.01077.xFind this resource:

Trude, A. M., & Brown-Schmidt, S. (2012). Talker-specific perceptual adaptation during online speech perception. Language and Cognitive Processes, 27(7–8), 979–1001. doi:10.1080/01690965.2011.597153Find this resource:

Verbrugge, R. R., Strange, W., Shankweiler, D. P., & Edman, T. R. (1976). What information enables a listener to map a talker’s vowel space?. Journal of the Acoustical Society of America, 60(1), 198–212. doi:10.1121/1.1919793Find this resource:

Vroomen, J., & Baart, M. (2012). Phonetic recalibration in audiovisual speech. In M. M. Murray & M. T. Wallace (Eds.), The neural bases of multisensory processes (pp. 363–380). Boca Raton, FL: CRC Press.Find this resource:

Walker, A., & Hay, J. (2011). Congruence between “word age” and “voice age” facilitates lexical access. Laboratory Phonology, 2(1), 219–237 doi:10.1515/labphon.2011.007Find this resource:

Walley, A. C., & Carrell, T. D. (1983). Onset specta and formant transition in the adult’s and child’s perception of place of articulation in stop consonants. Journal of the Acoustical Society of America, 73(3), 1011–1022. doi:10.1121/1.389149Find this resource:

Watkins, A. J., & Makin, S. J. (1996). Effects of spectral contrast on perceptual compensation for spectral-envelope distortion. Journal of the Acoustical Society of America, 99(6), 3749–3757. doi:10.1121/1.414981Find this resource:

Weatherholtz, K. (2015). Perceptual learning of systemic cross-category vowel variation (Unpublished doctoral dissertation). The Ohio State University, Columbus, OH.Find this resource:

Wilson, S. M. (2009). Speech perception when the motor system is compromised. Trends in Cognitive Science, 13(8), 329–330. doi:10.1016/j.tics.2009.06.001Find this resource:

Yang, J., & Fox, R. A. (2014). Perception of English vowels by bilingual Chinese-English and corresponding monolingual listeners. Language and Speech, 57(2), 215–237. doi:10.1177/0023830913502774Find this resource:

Zahorian, S. A., & Jagharghi, A. J. (1993). Spectral-shape features versus formants as acoustic correlates for vowels. Journal of the Acoustical Society of America, 94(4), 1966–1982. doi:10.1121/1.407520Find this resource:

Zue, V. W. (1976). Acoustic characteristics of stop consonants: A controlled study (Unpublished PhD Diss.). MIT, Cambridge, MA.Find this resource:

Notes:

(1.) As we discuss below, biological changes are not sufficient to explain cross-linguistic variance in the formant structure between male and female talkers (Johnson, 2006).

(2.) The general ratio-based proposal traces to the work of Richard John Lloyd in the late 1800s (Lloyd, 1890a, b, 1891, 1892). To quote from Lloyd’s doctoral thesis: “the great implied postulate of the organic system of phonetics . . . [is that] like articulations produce like sounds . . . For if half-a-dozen human beings, of identical type but widely differing size, all articulate a given vowel exactly in a given way, it is then clear that mathematically speaking, these six examples of the configuration of that vowel will be a series of similar figures. Now if this be true, whether vowel resonance be single or double or even more complex, it is certain that the pitch of that resonance or body of resonances will vary exactly in proportion to the relative size of the configuration from which it proceeds” (Lloyd, 1890a, p. 172, emphasis added). Here, Lloyd outlines the earliest account of F1) normalization, in which vowel formants are scaled by the talker’s fundamental frequency (F0), which is the acoustic correlate of perceived pitch.

(3.) The terms perceptual learning and adaptation have been variously defined in speech perception research (and more generally in the field of cognitive psychology). Sometimes perceptual learning refers to long-lasting changes in how the perceptual system processes incoming stimulus information, while adaptation is taken to refer to relatively short-term adjustments (see Goldstone, 1998), based on bottom-up information (Eisner, 2012; but see Kleinschmidt & Jaeger, 2015a). In yet other cases, adaptation is considered the behavioral outcome of any type of learning mechanism that tracks and responds to properties of the environment (see Kleinschmidt & Jaeger, 2015b; see also Bradlow & Bent, 2008; Fine, Jaeger, Farmer, & Qian, 2013; Maye, Aslin, & Tanenhaus, 2008).

(4.) The ideal adapter extends previous ideal observer accounts of speech perception, which assumed identity of the category-to-cue mappings across talkers (e.g., Norris, 2006; Norris & McQueen, 2008; Norris, McQueen, & Cutler, 2000; Sonderegger & Yu, 2010).

Was This Useful?