1-8 of 8 Results

  • Keywords: speech perception x
Clear all


The Motor Theory of Speech Perception  

D. H. Whalen

The Motor Theory of Speech Perception is a proposed explanation of the fundamental relationship between the way speech is produced and the way it is perceived. Associated primarily with the work of Liberman and colleagues, it posited the active participation of the motor system in the perception of speech. Early versions of the theory contained elements that later proved untenable, such as the expectation that the neural commands to the muscles (as seen in electromyography) would be more invariant than the acoustics. Support drawn from categorical perception (in which discrimination is quite poor within linguistic categories but excellent across boundaries) was called into question by studies showing means of improving within-category discrimination and finding similar results for nonspeech sounds and for animals perceiving speech. Evidence for motor involvement in perceptual processes nonetheless continued to accrue, and related motor theories have been proposed. Neurological and neuroimaging results have yielded a great deal of evidence consistent with variants of the theory, but they highlight the issue that there is no single “motor system,” and so different components appear in different contexts. Assigning the appropriate amount of effort to the various systems that interact to result in the perception of speech is an ongoing process, but it is clear that some of the systems will reflect the motor control of speech.


Dispersion Theory and Phonology  

Edward Flemming

Dispersion Theory concerns the constraints that govern contrasts, the phonetic differences that can distinguish words in a language. Specifically it posits that there are distinctiveness constraints that favor contrasts that are more perceptually distinct over less distinct contrasts. The preference for distinct contrasts is hypothesized to follow from a preference to minimize perceptual confusion: In order to recover what a speaker is saying, a listener must identify the words in the utterance. The more confusable words are, the more likely a listener is to make errors. Because contrasts are the minimal permissible differences between words in a language, banning indistinct contrasts reduces the likelihood of misperception. The term ‘dispersion’ refers to the separation of sounds in perceptual space that results from maximizing the perceptual distinctiveness of the contrasts between those sounds, and is adopted from Lindblom’s Theory of Adaptive Dispersion, a theory of phoneme inventories according to which inventories are selected so as to maximize the perceptual differences between phonemes. These proposals follow a long tradition of explaining cross-linguistic tendencies in the phonetic and phonological form of languages in terms of a preference for perceptually distinct contrasts. Flemming proposes that distinctiveness constraints constitute one class of constraints in an Optimality Theoretic model of phonology. In this context, distinctiveness constraints predict several basic phenomena, the first of which is the preference for maximal dispersion in inventories of contrasting sounds that first motivated the development of the Theory of Adaptive Dispersion. But distinctiveness constraints are formulated as constraints on the surface forms of possible words that interact with other phonological constraints, so they evaluate the distinctiveness of contrasts in context. As a result, Dispersion Theory predicts that contrasts can be neutralized or enhanced in particular phonological contexts. This prediction arises because the phonetic realization of sounds depends on their context, so the perceptual differences between contrasting sounds also depend on context. If the realization of a contrast in a particular context would be insufficiently distinct (i.e., it would violate a high-ranked distinctiveness constraint), there are two options: the offending contrast can be neutralized, or it can be modified (‘enhanced’) to make it more distinct. A basic open question regarding Dispersion Theory concerns the proper formulation of distinctiveness constraints and the extent of variation in their rankings across languages, issues that are tied up with the questions about the nature of perceptual distinctiveness. Another concerns the size and nature of the comparison set of contrasting word-forms required to be able to evaluate whether a candidate output satisfies distinctiveness constraints.


Cochlear Implants  

Matthew B. Winn and Peggy B. Nelson

Cochlear implants (CIs) are the most successful sensory implant in history, restoring the sensation of sound to thousands of persons who have severe to profound hearing loss. Implants do not recreate acoustic sound as most of us know it, but they instead convey a rough representation of the temporal envelope of signals. This sparse signal, derived from the envelopes of narrowband frequency filters, is sufficient for enabling speech understanding in quiet environments for those who lose hearing as adults and is enough for most children to develop spoken language skills. The variability between users is huge, however, and is only partially understood. CIs provide acoustic information that is sufficient for the recognition of some aspects of spoken language, especially information that can be conveyed by temporal patterns, such as syllable timing, consonant voicing, and manner of articulation. They are insufficient for conveying pitch cues and separating speech from noise. There is a great need for improving our understanding of functional outcomes of CI success beyond measuring percent correct for word and sentence recognitions. Moreover, greater understanding of the variability experienced by children, especially children and families from various social and cultural backgrounds, is of paramount importance. Future developments will no doubt expand the use of this remarkable device.


Audiovisual Speech Perception and the McGurk Effect  

Lawrence D. Rosenblum

Research on visual and audiovisual speech information has profoundly influenced the fields of psycholinguistics, perception psychology, and cognitive neuroscience. Visual speech findings have provided some of most the important human demonstrations of our new conception of the perceptual brain as being supremely multimodal. This “multisensory revolution” has seen a tremendous growth in research on how the senses integrate, cross-facilitate, and share their experience with one another. The ubiquity and apparent automaticity of multisensory speech has led many theorists to propose that the speech brain is agnostic with regard to sense modality: it might not know or care from which modality speech information comes. Instead, the speech function may act to extract supramodal informational patterns that are common in form across energy streams. Alternatively, other theorists have argued that any common information existent across the modalities is minimal and rudimentary, so that multisensory perception largely depends on the observer’s associative experience between the streams. From this perspective, the auditory stream is typically considered primary for the speech brain, with visual speech simply appended to its processing. If the utility of multisensory speech is a consequence of a supramodal informational coherence, then cross-sensory “integration” may be primarily a consequence of the informational input itself. If true, then one would expect to see evidence for integration occurring early in the perceptual process, as well in a largely complete and automatic/impenetrable manner. Alternatively, if multisensory speech perception is based on associative experience between the modal streams, then no constraints on how completely or automatically the senses integrate are dictated. There is behavioral and neurophysiological research supporting both perspectives. Much of this research is based on testing the well-known McGurk effect, in which audiovisual speech information is thought to integrate to the extent that visual information can affect what listeners report hearing. However, there is now good reason to believe that the McGurk effect is not a valid test of multisensory integration. For example, there are clear cases in which responses indicate that the effect fails, while other measures suggest that integration is actually occurring. By mistakenly conflating the McGurk effect with speech integration itself, interpretations of the completeness and automaticity of multisensory may be incorrect. Future research should use more sensitive behavioral and neurophysiological measures of cross-modal influence to examine these issues.


Second Language Phonetics  

Ocke-Schwen Bohn

The study of second language phonetics is concerned with three broad and overlapping research areas: the characteristics of second language speech production and perception, the consequences of perceiving and producing nonnative speech sounds with a foreign accent, and the causes and factors that shape second language phonetics. Second language learners and bilinguals typically produce and perceive the sounds of a nonnative language in ways that are different from native speakers. These deviations from native norms can be attributed largely, but not exclusively, to the phonetic system of the native language. Non-nativelike speech perception and production may have both social consequences (e.g., stereotyping) and linguistic–communicative consequences (e.g., reduced intelligibility). Research on second language phonetics over the past ca. 30 years has resulted in a fairly good understanding of causes of nonnative speech production and perception, and these insights have to a large extent been driven by tests of the predictions of models of second language speech learning and of cross-language speech perception. It is generally accepted that the characteristics of second language speech are predominantly due to how second language learners map the sounds of the nonnative to the native language. This mapping cannot be entirely predicted from theoretical or acoustic comparisons of the sound systems of the languages involved, but has to be determined empirically through tests of perceptual assimilation. The most influential learner factors which shape how a second language is perceived and produced are the age of learning and the amount and quality of exposure to the second language. A very important and far-reaching finding from research on second language phonetics is that age effects are not due to neurological maturation which could result in the attrition of phonetic learning ability, but to the way phonetic categories develop as a function of experience with surrounding sound systems.


Direct Perception of Speech  

Carol A. Fowler

The theory of speech perception as direct derives from a general direct-realist account of perception. A realist stance on perception is that perceiving enables occupants of an ecological niche to know its component layouts, objects, animals, and events. “Direct” perception means that perceivers are in unmediated contact with their niche (mediated neither by internally generated representations of the environment nor by inferences made on the basis of fragmentary input to the perceptual systems). Direct perception is possible because energy arrays that have been causally structured by niche components and that are available to perceivers specify (i.e., stand in 1:1 relation to) components of the niche. Typically, perception is multi-modal; that is, perception of the environment depends on specifying information present in, or even spanning, multiple energy arrays. Applied to speech perception, the theory begins with the observation that speech perception involves the same perceptual systems that, in a direct-realist theory, enable direct perception of the environment. Most notably, the auditory system supports speech perception, but also the visual system, and sometimes other perceptual systems. Perception of language forms (consonants, vowels, word forms) can be direct if the forms lawfully cause specifying patterning in the energy arrays available to perceivers. In Articulatory Phonology, the primitive language forms (constituting consonants and vowels) are linguistically significant gestures of the vocal tract, which cause patterning in air and on the face. Descriptions are provided of informational patterning in acoustic and other energy arrays. Evidence is next reviewed that speech perceivers make use of acoustic and cross modal information about the phonetic gestures constituting consonants and vowels to perceive the gestures. Significant problems arise for the viability of a theory of direct perception of speech. One is the “inverse problem,” the difficulty of recovering vocal tract shapes or actions from acoustic input. Two other problems arise because speakers coarticulate when they speak. That is, they temporally overlap production of serially nearby consonants and vowels so that there are no discrete segments in the acoustic signal corresponding to the discrete consonants and vowels that talkers intend to convey (the “segmentation problem”), and there is massive context-sensitivity in acoustic (and optical and other modalities) patterning (the “invariance problem”). The present article suggests solutions to these problems. The article also reviews signatures of a direct mode of speech perception, including that perceivers use cross-modal speech information when it is available and exhibit various indications of perception-production linkages, such as rapid imitation and a disposition to converge in dialect with interlocutors. An underdeveloped domain within the theory concerns the very important role of longer- and shorter-term learning in speech perception. Infants develop language-specific modes of attention to acoustic speech signals (and optical information for speech), and adult listeners attune to novel dialects or foreign accents. Moreover, listeners make use of lexical knowledge and statistical properties of the language in speech perception. Some progress has been made in incorporating infant learning into a theory of direct perception of speech, but much less progress has been made in the other areas.


Speech Perception and Generalization Across Talkers and Accents  

Kodi Weatherholtz and T. Florian Jaeger

The seeming ease with which we usually understand each other belies the complexity of the processes that underlie speech perception. One of the biggest computational challenges is that different talkers realize the same speech categories (e.g., /p/) in physically different ways. We review the mixture of processes that enable robust speech understanding across talkers despite this lack of invariance. These processes range from automatic pre-speech adjustments of the distribution of energy over acoustic frequencies (normalization) to implicit statistical learning of talker-specific properties (adaptation, perceptual recalibration) to the generalization of these patterns across groups of talkers (e.g., gender differences).



D. H. Whalen

Phonetics is the branch of linguistics that deals with the physical realization of meaningful distinctions in spoken language. Phoneticians study the anatomy and physics of sound generation, acoustic properties of the sounds of the world’s languages, the features of the signal that listeners use to perceive the message, and the brain mechanisms involved in both production and perception. Therefore, phonetics connects most directly to phonology and psycholinguistics, but it also engages a range of disciplines that are not unique to linguistics, including acoustics, physiology, biomechanics, hearing, evolution, and many others. Early theorists assumed that phonetic implementation of phonological features was universal, but it has become clear that languages differ in their phonetic spaces for phonological elements, with systematic differences in acoustics and articulation. Such language-specific details place phonetics solidly in the domain of linguistics; any complete description of a language must include its specific phonetic realization patterns. The description of what phonetic realizations are possible in human language continues to expand as more languages are described; many of the under-documented languages are endangered, lending urgency to the phonetic study of the world’s languages. Phonetic analysis can consist of transcription, acoustic analysis, measurement of speech articulators, and perceptual tests, with recent advances in brain imaging adding detail at the level of neural control and processing. Because of its dual nature as a component of a linguistic system and a set of actions in the physical world, phonetics has connections to many other branches of linguistics, including not only phonology but syntax, semantics, sociolinguistics, and clinical linguistics as well. Speech perception has been shown to integrate information from both vision and tactile sensation, indicating an embodied system. Sign language, though primarily visual, has adopted the term “phonetics” to represent the realization component, highlighting the linguistic nature both of phonetics and of sign language. Such diversity offers many avenues for studying phonetics, but it presents challenges to forming a comprehensive account of any language’s phonetic system.