- Daniel Aalto, Daniel AaltoDepartment of Communication Sciences and Disorders, Faculty of Rehabilitation Medicine, University of Alberta; Institute for Reconstructive Sciences in Medicine, Misericordia Community Hospital, Edmonton, Canada
- Jarmo MalinenJarmo MalinenDepartment of Mathematics and Systems Analysis, Aalto University
- and Martti VainioMartti VainioDepartment of Digital Humanities, University of Helsinki
Formant frequencies are the positions of the local maxima of the power spectral envelope of a sound signal. They arise from acoustic resonances of the vocal tract air column, and they provide substantial information about both consonants and vowels. In running speech, formants are crucial in signaling the movements with respect to place of articulation. Formants are normally defined as accumulations of acoustic energy estimated from the spectral envelope of a signal. However, not all such peaks can be related to resonances in the vocal tract, as they can be caused by the acoustic properties of the environment outside the vocal tract, and sometimes resonances are not seen in the spectrum. Such formants are called spurious and latent, respectively. By analogy, spectral maxima of synthesized speech are called formants, although they arise from a digital filter. Conversely, speech processing algorithms can detect formants in natural or synthetic speech by modeling its power spectral envelope using a digital filter. Such detection is most successful for male speech with a low fundamental frequency where many harmonic overtones excite each of the vocal tract resonances that lie at higher frequencies. For the same reason, reliable formant detection from females with high pitch or children’s speech is inherently difficult, and many algorithms fail to faithfully detect the formants corresponding to the lowest vocal tract resonant frequencies.