A computational learner needs three things: Data to learn from, a class of representations to acquire, and a way to get from one to the other. Language acquisition is a very particular learning setting that can be defined in terms of the input (the child’s early linguistic experience) and the output (a grammar capable of generating a language very similar to the input). The input is infamously impoverished. As it relates to morphology, the vast majority of potential forms are never attested in the input, and those that are attested follow an extremely skewed frequency distribution. Learners nevertheless manage to acquire most details of their native morphologies after only a few years of input. That said, acquisition is not instantaneous nor is it error-free. Children do make mistakes, and they do so in predictable ways which provide insights into their grammars and learning processes. The most elucidating computational model of morphology learning from the perspective of a linguist is one that learns morphology like a child does, that is, on child-like input and along a child-like developmental path. This article focuses on clarifying those aspects of morphology acquisition that should go into such an elucidating a computational model. Section 1 describes the input with a focus on child-directed speech corpora and input sparsity. Section 2 discusses representations with focuses on productivity, developmental paths, and formal learnability. Section 3 surveys the range of learning tasks that guide research in computational linguistics and NLP with special focus on how they relate to the acquisition setting. The conclusion in Section 4 presents a summary of morphology acquisition as a learning problem with Table 4 highlighting the key takeaways of this article.
Yu-Ying Chuang and R. Harald Baayen
Naive discriminative learning (NDL) and linear discriminative learning (LDL) are simple computational algorithms for lexical learning and lexical processing. Both NDL and LDL assume that learning is discriminative, driven by prediction error, and that it is this error that calibrates the association strength between input and output representations. Both words’ forms and their meanings are represented by numeric vectors, and mappings between forms and meanings are set up. For comprehension, form vectors predict meaning vectors. For production, meaning vectors map onto form vectors. These mappings can be learned incrementally, approximating how children learn the words of their language. Alternatively, optimal mappings representing the end state of learning can be estimated. The NDL and LDL algorithms are incorporated in a computational theory of the mental lexicon, the ‘discriminative lexicon’. The model shows good performance both with respect to production and comprehension accuracy, and for predicting aspects of lexical processing, including morphological processing, across a wide range of experiments. Since, mathematically, NDL and LDL implement multivariate multiple regression, the ‘discriminative lexicon’ provides a cognitively motivated statistical modeling approach to lexical processing.
Phonotactics is the study of restrictions on possible sound sequences in a language. In any language, some phonotactic constraints can be stated without reference to morphology, but many of the more nuanced phonotactic generalizations do make use of morphosyntactic and lexical information. At the most basic level, many languages mark edges of words in some phonological way. Different phonotactic constraints hold of sounds that belong to the same morpheme as opposed to sounds that are separated by a morpheme boundary. Different phonotactic constraints may apply to morphemes of different types (such as roots versus affixes). There are also correlations between phonotactic shapes and following certain morphosyntactic and phonological rules, which may correlate to syntactic category, declension class, or etymological origins. Approaches to the interaction between phonotactics and morphology address two questions: (1) how to account for rules that are sensitive to morpheme boundaries and structure and (2) determining the status of phonotactic constraints associated with only some morphemes. Theories differ as to how much morphological information phonology is allowed to access. In some theories of phonology, any reference to the specific identities or subclasses of morphemes would exclude a rule from the domain of phonology proper. These rules are either part of the morphology or are not given the status of a rule at all. Other theories allow the phonological grammar to refer to detailed morphological and lexical information. Depending on the theory, phonotactic differences between morphemes may receive direct explanations or be seen as the residue of historical change and not something that constitutes grammatical knowledge in the speaker’s mind.
Basilio Calderone and Vito Pirrelli
Nowadays, computer models of human language are instrumental to millions of people, who use them every day with little if any awareness of their existence and role. Their exponential development has had a huge impact on daily life through practical applications like machine translation or automated dialogue systems. It has also deeply affected the way we think about language as an object of scientific inquiry. Computer modeling of Romance languages has helped scholars develop new theoretical frameworks and new ways of looking at traditional approaches. In particular, computer modeling of lexical phenomena has had a profound influence on some fundamental issues in human language processing, such as the purported dichotomy between rules and exceptions, or grammar and lexicon, the inherently probabilistic nature of speakers’ perception of analogy and word internal structure, and their ability to generalize to novel items from attested evidence. Although it is probably premature to anticipate and assess the prospects of these models, their current impact on language research can hardly be overestimated. In a few years, data-driven assessment of theoretical models is expected to play an irreplaceable role in pacing progress in all branches of language sciences, from typological and pragmatic approaches to cognitive and formal ones.
Daniel Schmidtke and Victor Kuperman
Lexical representations in an individual mind are not given to direct scrutiny. Thus, in their theorizing of mental representations, researchers must rely on observable and measurable outcomes of language processing, that is, perception, production, storage, access, and retrieval of lexical information. Morphological research pursues these questions utilizing the full arsenal of analytical tools and experimental techniques that are at the disposal of psycholinguistics. This article outlines the most popular approaches, and aims to provide, for each technique, a brief overview of its procedure in experimental practice. Additionally, the article describes the link between the processing effect(s) that the tool can elicit and the representational phenomena that it may shed light on. The article discusses methods of morphological research in the two major human linguistic faculties—production and comprehension—and provides a separate treatment of spoken, written and sign language.
Claudia Marzi and Vito Pirrelli
Over the past decades, psycholinguistic aspects of word processing have made a considerable impact on views of language theory and language architecture. In the quest for the principles governing the ways human speakers perceive, store, access, and produce words, inflection issues have provided a challenging realm of scientific inquiry, and a battlefield for radically opposing views. It is somewhat ironic that some of the most influential cognitive models of inflection have long been based on evidence from an inflectionally impoverished language like English, where the notions of inflectional regularity, (de)composability, predictability, phonological complexity, and default productivity appear to be mutually implied. An analysis of more “complex” inflection systems such as those of Romance languages shows that this mutual implication is not a universal property of inflection, but a contingency of poorly contrastive, nearly isolating inflection systems. Far from presenting minor faults in a solid, theoretical edifice, Romance evidence appears to call into question the subdivision of labor between rules and exceptions, the on-line processing vs. long-term memory dichotomy, and the distinction between morphological processes and lexical representations. A dynamic, learning-based view of inflection is more compatible with this data, whereby morphological structure is an emergent property of the ways inflected forms are processed and stored, grounded in universal principles of lexical self-organization and their neuro-functional correlates.
Corpora are an all-important resource in linguistics, as they constitute the primary source for large-scale examples of language usage. This has been even more evident in recent years, with the increasing availability of texts in digital format leading more and more corpus linguistics toward a “big data” approach. As a consequence, the quantitative methods adopted in the field are becoming more sophisticated and various. When it comes to morphology, corpora represent a primary source of evidence to describe morpheme usage, and in particular how often a particular morphological pattern is attested in a given language. There is hence a tight relation between corpus linguistics and the study of morphology and the lexicon. This relation, however, can be considered bi-directional. On the one hand, corpora are used as a source of evidence to develop metrics and train computational models of morphology: by means of corpus data it is possible to quantitatively characterize morphological notions such as productivity, and corpus data are fed to computational models to capture morphological phenomena at different levels of description. On the other hand, morphology has also been applied as an organization principle to corpora. Annotations of linguistic data often adopt morphological notions as guidelines. The resulting information, either obtained from human annotators or relying on automatic systems, makes corpora easier to analyze and more convenient to use in a number of applications.