Show Summary Details

Page of

Printed from Oxford Research Encyclopedias, Communication. Under the terms of the licence agreement, an individual user may print out a single article for personal use (for details see Privacy Policy and Legal Notice).

date: 08 December 2023

Cognitive and Interactive Mechanisms for Mutual Understanding in Conversationfree

Cognitive and Interactive Mechanisms for Mutual Understanding in Conversationfree

  • Ashley MicklosAshley MicklosUtrecht University
  •  and Marieke WoensdregtMarieke WoensdregtUniversity of Edinburgh


Everyday conversation is, as the term suggests, a frequent and seemingly effortless phenomenon. However, when closely examined, it is seen that the process of achieving mutual understanding in conversation involves both complex social reasoning and finely tuned interactive mechanisms. Referential communication provides an excellent case study for what makes everyday language interactions complex: people recruit an intricate web of cognitive capacities and interactive resources in order to get their message across. In terms of cognitive capacities, reaching mutual understanding in conversation involves social reasoning in order to establish common ground and take into account one’s conversational partner when producing and interpreting utterances. Specifically, people continuously adapt to their conversational partner by keeping track of what information is or is not shared (based on the situational context, preceding discourse, and general knowledge) and adjusting their utterances and interpretations accordingly. In terms of interactive resources, mechanisms that allow us to keep a conversation on track (e.g., backchannels) and the mechanisms that allow us to recover from breakdowns in communication (i.e., repair) contribute to mutual understanding. Specifically, other-initiated repair, a conversational phenomenon that has been documented cross-linguistically and observed in experimental settings, is an interactional resource for (re)establishing intersubjectivity between interlocutors. The historic separation between cognitive capacities on the one hand and interactive resources on the other hand has created an artificial divide, when in fact both mechanisms interact with, and even presuppose, one another. This article puts forward a unified perspective on the cognitive and interactive mechanisms for mutual understanding, moving towards better understanding of the complementary roles of these mechanisms in interaction.


  • Interpersonal Communication
  • Language and Social Interaction


In everyday interactions with others, we are able to manage a range of communicative actions and intentions. We can slyly suggest someone should help carry out a task, adeptly argue in a debate, and carry out extended discussions on complicated subjects. The extraordinariness we exhibit in our communicative abilities is underpinned by mechanisms that allow us to recover from breakdowns that inevitably arise as we interact with others. Consider the following excerpt of dialogue between two people who live together, share the same social circle, and have spent almost 1 year in constant contact with one another during the Covid-19 pandemic:

Jaime: Bill got his first vaccination today.

Andy: Really? How’s that possible, what is New Jersey doing?

Jaime: No, not that Bill. Neighbor Bill.

Andy: Oh, that makes more sense.

Though the talk itself is not complex, miscommunication arises when Andy is unable to initially recognize the intended referent—the retirement-aged next-door neighbor, Bill—and steers the conversation into a direction that was not intended—about a mutual friend named Bill who lives in New Jersey and is not in the vaccination age group at the time. While the talk is indeed ordinary, there is a great deal that comes into play in these everyday interactions that make them complex. First, factors relating to the individuals involved, and their experiences, can vary. For instance, each individual brings different experiences to the conversation; in the example, Jaime had just been outside where they encountered neighbor Bill. Andy, on the other hand, had remained inside and was not aware of the neighborly interaction. Therefore Jaime, but not Andy, had an experience that narrowed potential referents. Moreover, each individual may draw from certain contextual cues from which to construct a particular referent. For example, Andy saw that Jaime was holding a phone during this news telling, and the phone (i.e., texting) is how they communicate with New Jersey Bill; Andy may have assumed the news just came in via text, and merited telling. Second, factors related to the referent can result in complexity. In this example, the referent is displaced from the conversation—this is a discussion about a person who is not immediately present. In addition, the referent has the potential for ambiguity; there are (apparently) at least two Bills, and the generic reference “Bill” does not help to clarify which of these two possible referents is intended. Finally, the role of perspective-taking (i.e., making inferences about what the other has in mind) also adds complexity. Although both Bills can be considered mutual knowledge between Andy and Jaime (what is known as common ground), Jaime might wrongly assume neighbor Bill to be the one most salient to Andy when thinking of Bills they both know, whereas Andy might first think of New Jersey Bill when looking for Bills that Jaime could assume to be shared and salient (e.g., because Andy knows that Jaime had a video call with New Jersey Bill a couple of days ago). This example, and certainly the countless others that arise in ordinary, daily conversation, reveals the complexities that arise in human language interactions.

Communication problems are not unique; in fact, when doing reference, in particular when displaced, misunderstandings are quite prevalent (Bazzanella & Damiano, 1999; Drummond & Hopper, 1991). Therefore, to facilitate recognition, there are preferences for doing reference in particular ways (Sacks & Schegloff, 1979), as well as for highlighting potentially troublesome referents. Try-marking, for instance, is the producer’s1 marking of a referent as potentially difficult to recognize by overlaying a rising intonational contour (as in questions), leaving a brief pause following the referential expression (Moerman, 1988), maintaining eye contact, and/or holding a sign for an extended duration (Byun, Vos, et al., 2018), seemingly to allow space for the recipient to acknowledge recognition (Clark & Bangerter, 2004; Clark & Wilkes-Gibbs, 1986; Sacks & Schegloff, 1979). Try-marking makes visible two broad sets of processes that underpin human communicative interaction: (a) cognitive and (b) interactive. Regarding cognitive processes, try-marking indicates what the producer is thinking with respect to the retrievability of their referent: that it may be difficult to recognize for the receiver. For interactive processes, try-marking invites the recipient to participate in the construction of the referential expression: they can confirm recognition or further question the referent. Most notably, try-marking reveals that the goal of referential communication is mutual understanding. Referential communication is a particularly interesting context for investigating complex language interactions because it is in these contexts that mutual understanding becomes relevant and actionable.2

Assuming language use is cooperative in nature (Grice, 1975; Levinson, 2006), the goal of referential communication is to reach mutual understanding. Given this goal, and the fact that neither interlocutor can look into the other’s mind to directly observe what was intended or understood, referential communication requires negotiation. This involves converging not just on a lexical description of the intended referent but also on a shared conceptualization (Brennan & Clark, 1996). This negotiation process can take on an explicit form, where one person proposes a description and receives explicit assent from their interlocutor. But it can also take on an implicit form, where one person produces a given lexical description (with inherent in it a particular conceptualization) which is later reused by their interlocutor, and thereby implicitly agreed upon, a process known as interactive alignment (Fusaroli et al., 2012; Pickering & Garrod, 2004; Rasenberg et al., 2020). It is through implicit as well as explicit negotiation, sometimes by aligning on linguistic forms, sometimes by making conversational moves that are complementary to one another (Fusaroli & Tylén, 2012), that conversation partners make sure they are, and stay, on the same page. While the convergence on lexical referents, called linguistic alignment, can occur through these processes, mutual understanding of referents does not necessarily require it.

Clark and Bangerter (2004, p. 35) assert that “[e]stablishing common ground is a central problem in acts of referring, and people have evolved strategies for doing that.” In other words, the fact that conversation partners have the aim to reach mutual understanding makes referential communication challenging. However, the goal of reaching mutual understanding may at the same time hold the key to what makes doing reference possible: a constant collaborative effort toward this shared goal is what streamlines our complex language interactions. The aim of this article is to outline the cognitive and interactive processes involved: the “strategies” that promote mutual understanding in referential communication. For the sake of clarity, this article will start by discussing each of these types of processes separately. “Cognitive Mechanisms” discusses the cognitive processes involved in referential communication, with a focus on grounding and perspective-taking (based on the situational context, preceding discourse, and general world knowledge). In other words: those cognitive processes through which conversation partners simulate each other’s minds and thereby establish and make use of common ground (see “Common Ground” and “Perspective-Taking”). “Interactive Mechanisms” discusses the interactive mechanisms that streamline language use, including both strategies that are employed to keep the conversation on track, namely backchannels (see “Backchannels”), and strategies that are employed to recover from breakdowns in communication, specifically other-initiated repair (see “Observational Accounts of OIR” and “Experimental Accounts of OIR”). Note, however, that this division into separate sections creates an artificial divide between the cognitive and interactive mechanisms involved in achieving mutual understanding; in reality, these two types of processes are intimately intertwined. “The Complementary Nature of Cognitive and Interactive Mechanisms” discusses how they interact and work together toward achieving mutual understanding. This article will end by considering both the theoretical implications of the intertwined nature of these processes as well as the broader social and cultural evolutionary dynamics in which they are embedded (see “Summary and Future Directions”).

Cognitive Mechanisms

Natural languages are rife with ambiguity (Wasow et al., 2005) and to add insult to injury, our utterances come with the problem of underdeterminacy (Carston, 2002): any given utterance can have an infinite set of potential intended meanings. If someone says I’m tired, this can, depending on the context, mean anything from Let’s not prolong this virtual meeting to Let’s go to the pub to I’m thinking of quitting my job. Our ability to take into account the context when producing and interpreting utterances has thus been argued to play a crucial role in making successful communication possible (Clark, 1996; Levinson, 1983; Sperber & Wilson, 1995). Our conversational partner can be viewed as part of that context, and indeed as a particularly potent contextual cue because they enrich the context with their own perspective, knowledge, beliefs and goals, which can serve as a strongly constraining factor when determining how to convey a message or interpret an utterance (Brown-Schmidt et al., 2015). However, taking into account one’s conversational partner is also arguably the most challenging aspect of taking into account the context because it involves reasoning about something that is unobservable: your conversational partner’s mental states (although see Lavelle, 2012; Michael & De Bruin, 2015; Overgaard, 2017 for a critical discussion of this “unobservability thesis”). This section will review both empirical and theoretical evidence for the role that such reasoning about one’s interlocutor’s mind plays in achieving mutual understanding in referential communication.

Common Ground

The fact that the context of a conversation consists of not only the physical situation and the preceding dialogue but also the conversational partners themselves poses a challenge: How do people achieve mutual understanding if they both enter the conversation with their own perspective, beliefs, and knowledge? A classic account is that conversation partners gradually build up a representation of joint knowledge, or common ground, which they use to guide their utterances and interpretations (Clark, 1996, Chapter 4, 2006; Clark & Brennan, 1991). Working from the assumption that another person’s mental states are not directly observable, Clark and Marshall (1981) argue that a representation of common ground is formed using several sources of observable information: (a) physical copresence (i.e., things the conversation partners have experienced together); (b) linguistic copresence (i.e., things the conversation partners have previously talked about); and (c) community comembership (i.e., world knowledge that is likely to be shared based on being part of the same community). In addition to establishing a representation of common ground based on these information sources, conversation partners also continuously update their common ground through a process known as grounding, in which they try to establish whether their utterances have come across as intended (Clark & Brennan, 1991). This process of grounding involves a rich array of interactive mechanisms, which will be discussed in more detail in “Interactive Mechanisms” (the example of try-marking given in “Introduction” is a form of referential grounding).

If conversation partners jointly assume that they both make use of their common ground when producing and interpreting utterances, this can greatly enhance the efficiency and effectiveness of referential communication. For example (assuming that communication is cooperative; Grice, 1975; Levinson, 2006), if a speaker refers to “the house,” that utterance comes with the implication that she3 assumes (given her representation of common ground) that it is specific enough for the receiver to come to a unique interpretation (Rubio-Fernández, 2021). The receiver presumably knows about many houses, the knowledge of which is presumably shared with different people. But the commitment implicit in the producer’s utterance can guide the receiver in his interpretation process: only by assuming that the producer has reason to believe that her utterance provides enough information to come to a unique representation based on their common ground can the receiver infer which house the producer is referring to (Brown-Schmidt et al., 2015; Sperber & Wilson, 1995). As this example illustrates, successful use of common ground requires perspective-taking. “Perspective-Taking” discusses how such perspective-taking is used in both production and comprehension. Note, however, that common ground is a narrower concept than perspective-taking: whereas perspective-taking can be done unilaterally, the (successful) establishment of common ground requires conversation partners to jointly keep track of what is mutually known.


If a producer aims to get her communicative intention across, it is useful to take her interlocutor’s perspective into account. That is, by considering how a given utterance would likely be interpreted by the receiver (for instance, based on what the producer thinks the receiver knows or doesn’t know), the producer can tailor her utterance to be as effective as possible. This practice of targeting one’s utterance to a particular recipient is known as recipient design (Sacks et al., 1974) or audience design (Clark & Murphy, 1982). The other way around, if the recipient’s aim is to infer the producer’s communicative intention, he should in turn take the producer’s perspective into account. Considering the producer’s perspective and assuming she will have chosen her utterance with the receiver in mind (e.g., based on their common ground) provides the receiver with a powerful tool to narrow down the set of possible meanings of the utterance and home in on the producer’s most likely intention (Levinson, 2000; Sperber & Wilson, 1995). This section will review empirical studies that have investigated to what extent and under what circumstances interlocutors take into account each other’s perspectives in referential communication, considering each of the three sources of common ground (physical copresence, linguistic copresence, and shared knowledge) in turn.

Perspective-Taking Based on Physical Copresence

The extent to which producers and receivers take each other’s perspective into account based on physical copresence (i.e., things the interlocutors have experienced together) has been productively investigated using an experimental paradigm developed by Keysar et al. (2000). In this experiment, two participants view the same array of objects placed in a grid of squares (resembling cubbyholes), but some objects are occluded from one participant’s point of view (as illustrated in Figure 1). This paradigm allows the experimenters to create situations where the visual context for one participant (two differently sized glasses, one of which is in the participant’s “privileged ground”) is incongruent with the partner-specific context (a single glass in common ground). In the production version of this task, the participant (the “director”) is asked to provide instructions to a receiver (the “matcher”) for how to rearrange the objects in the grid to match a certain configuration. In the comprehension version of the task, the producer is a confederate who gives scripted instructions, and the participant is asked to take on the role of the matcher while being eye-tracked.

Figure 1. Illustration of perspective-taking experimental paradigm.

Source: Developed by Keysar et al. (2000).

Participants are placed opposite each other, looking at the same array of objects arranged in cubbyholes. However, one of these cubbyholes is occluded from Participant B’s point of view, such that Participant A can see both glasses whereas Participant B can only see the large glass. In other words, the large glass (as well as the crayon and the tennis ball) are in common ground, but the small glass is only in Participant A’s privileged ground.

In the production task, the variable of interest is whether the director (who has a view with something in privileged ground, similar to Participant A in Figure 1) produces overspecified utterances: utterances that give more information than what is strictly necessary for the receiver to uniquely identify the intended referent (such as saying the large glass rather than just the glass if the recipient has the view of Participant B in Figure 1). The extra information given in such an overspecified utterance is redundant (since the receiver can only see one glass), and therefore less efficient (Grice, 1975).4 Moreover, there is some evidence that overspecified utterances are potentially confusing to the receiver (Engelhardt et al., 2006; Gann & Barr, 2014). Studies using this experimental paradigm have shown that in a condition where the receiver can only see one of the objects (as Participant B in Figure 1), producers leave out the redundant adjective about 50% of the time. In contrast, in a condition where the receiver can see both objects, producers include the adjective in their utterances between 70% and 100% of the time (depending on the complexity of the visual scene and other task factors; for reviews, see Brown-Schmidt et al., 2015; Ryskin et al., 2015). Taken together, these results indicate that producers tend to take the receiver’s perspective into account. The producer’s communicative goal may also play a role in modulating this effect. Yoon et al. (2012) showed that producers are more likely to take their receiver’s perspective into account when requesting the receiver to perform an action with the target object (i.e., moving it) compared to when they are simply informing the receiver of an action that is being performed with the target object by the experimenter. This suggests that the more crucial it is for the producer’s communicative goal that the receiver reach a unique interpretation, the more likely the producer is to take into account the receiver’s perspective.

In the reception version of this task, the variable of interest is whether the matcher (who now has a perspective with something in privileged ground, similar to Participant A in Figure 1) looks at objects that are only in her privileged ground when presented with an utterance that is underspecified from her egocentric perspective (i.e., when she doesn’t take the producer’s perspective into account). For example, a critical trial in the reception task may consist of a case where the receiver can see three differently sized glasses, with the smallest one being in privileged ground, and the confederate director asks her to move the small glass. When the eye-tracking measures reveal that a participant looked at an object in her privileged ground before picking up another object and moving it according to the director’s instructions, this is taken as evidence that she (briefly) considered the object in privileged ground as a potential referent of the director’s utterance. The assumption here is that because the objects that are in the matcher’s privileged ground are occluded from the director’s point of view, they cannot possibly be the director’s intended referent. Therefore, a participant who behaves fully allocentrically (i.e., always taking the producer’s perspective into account) should never look at the objects in her own privileged ground. Studies using this comprehension version of the task have yielded mixed results. Early studies found that participants generally failed to interpret utterances exclusively from the producer’s perspective (Keysar et al., 2000, 2003; Lin et al., 2010). Keysar et al. (2000, 2003), and Lin et al. (2010) argued that this shows that perspective-taking in comprehension is not an automatic and effortless process, but that instead people are “reflexively mindblind”: interpreting utterances from their egocentric perspective by default and only overcoming this egocentric bias through effortful perspective-taking when the egocentric interpretation causes ambiguity.

However, later variations on this experiment showed that participants do rapidly restrict the domain of interpretation to only those objects that are in common ground (Brown-Schmidt, 2012; Hanna & Tanenhaus, 2004; Hanna et al., 2003; Heller et al., 2008). Furthermore, recent work by Hawkins and Goodman (2016), Hawkins et al. (2021) and Rubio-Fernandez (2017) has called into question the assumption that considering objects in privileged ground as potential referents is necessarily a sign of egocentric processing. Hawkins and Goodman (2016), Hawkins et al. (2021) show that the utterances that lead to a recipient ignoring her interlocutor’s perspective in interpretation (utterances that in previous experiments were scripted by the experimenter and produced by a confederate) are in fact utterances that speakers do not naturally produce because they are uncooperative given the context. That is, if both participants know that some squares in the grid are occluded for their interlocutor (i.e., if these “known unknowns” are part of common ground), it is rational for the director to produce overinformative utterances in order to avoid potential ambiguity. If the receiver in turn expects the director to do this, it is rational for her to select the object that best fits the producer’s utterance, which will lead to her being “tricked” into considering an object that is occluded from the director’s view. To test this hypothesis, Hawkins and Goodman (2016), Hawkins et al. (2021) ran a version of the task where both the director and the receiver are participants (rather than the director being a confederate) and the director is free to produce her own instructions (rather than following a script). They found that directors naturally produce utterances that more precisely pick out the target object relative to the distractor object in the receiver’s privileged view (which is hidden for the director) and that recipients show better performance in terms of perspective-taking. Hawkins et al. therefore conclude that the early findings of egocentricity reported by Keysar et al. (2000, 2003), and Lin et al. (2010) were (ironically) the result of sophisticated social reasoning about how a cooperative producer would be likely to choose her utterances given the context. In sum, questions remain about the extent to which perspective-taking in comprehension is automatic and effortless, but it is clear that receivers are able to employ it when the circumstances of the experiment are sufficiently natural. To get to a full understanding of the extent to which people engage in social reasoning in referential communication, future experiments should therefore carefully consider their ecological validity (see also Brennan & Hanna, 2009; de Ruiter & Albert, 2017; Rubio-Fernández, 2017).

Perspective-Taking Based on Linguistic Copresence

Linguistic copresence (i.e., things the interlocutors have previously talked about) has been shown to be taken into account by interlocutors as well. An experimental paradigm that has shown this convincingly is a referential communication task where one participant (the director) has to give instructions to another participant (the matcher) for how to rearrange an array of novel objects (such as tangrams; see Figure 2; Brown-Schmidt, 2012; Hanna & Tanenhaus, 2004). In contrast to referential communication tasks where the stimuli used are familiar objects that can easily be named, this referential communication task uses novel stimuli, for which participants have to cocreate novel labels that can uniquely describe each stimulus. Clark and Wilkes-Gibbs (1986) showed that over multiple rounds of interaction between a pair of participants, these descriptions become shorter and opaquer. This shortening of descriptions indicates that the participants decrease their joint effort over time, while the increase in the opacity of descriptions indicates that the participants gradually build up common ground based on linguistic copresence. For example, for the tangram highlighted in red in Figure 2, a director might start with the description bowing man with a triangle bum, which might in a later round be shortened to simply triangle bum (adapted from Micklos et al., 2020). Importantly, these short and opaque descriptions that develop over a number of trials are dyad-specific. It has been shown that if participants are asked to describe the same stimuli to a new, unfamiliar, partner, they switch back to longer and more elaborate descriptions, sometimes also involving a reconceptualization of the stimulus (Brennan & Clark, 1996; Horton & Gerrig, 2002; Horton & Spieler, 2007; Wilkes-Gibbs & Clark, 1992).

Figure 2. Set of tangram shapes.

Source: Used in and reproduced from Micklos et al. (2020).

A number of experiments have shown that shared linguistic experience is also taken into account during comprehension (Brennan & Hanna, 2009; Brown-Schmidt, 2009; Brown-Schmidt et al., 2008; Horton & Slaten, 2012; Metzing & Brennan, 2003). For example, Brown-Schmidt et al. (2008) developed an eye-tracking experiment where participants played a cooperative task in which some stimuli were in common ground while others were in privileged ground. As part of this game, producers asked questions like What’s below the cow that’s wearing the hat? To the receiver, the start of this utterance (What’s below the cow) was ambiguous because two different cows were part of the visual display (e.g., one wearing a hat, and another wearing shoes). In one condition, the task was constructed in such a way that participants had just discussed the animal below the cow wearing shoes, thus bringing that animal (the one below the cow wearing shoes) into common ground. Brown-Schmidt et al. contrasted this with a control condition in which participants had just discussed a different, unrelated animal. Brown-Schmidt et al. found that in the first condition, receivers rapidly interpreted the producer’s question (during the ambiguous part of the utterance) to be about the cow with the hat. This indicates that receivers took into account that the animal below the cow wearing shoes must already be known to the producer, given that they had just discussed it (see further the concept of “givenness”: Krifka, 2008). In the control condition in contrast, the eye-tracking results showed that receivers considered both cows to be the potential intended referent until the disambiguating word hat was produced.

Perspective-Taking Based on Shared World Knowledge

Perspective-taking based purely on community comembership (i.e., world knowledge that is likely to be shared based on being part of the same community), in the absence of any contextual constraints, has been shown to be difficult. To assess whether people use shared world knowledge in production and comprehension, Sulik and Lupyan (2018) devised a novel signaling task in which participants were asked to use their native language (English) to convey a target word to a receiver using any other word, in such a way that it would allow the receiver to infer what the target word was. For example, to convey the target word bank, a producer might use the clue word money (as this is a salient association). However, given the clue word money, a receiver is unlikely to infer that the target word must have been bank because money has other more salient associations (such as cash, etc.). As Sulik and Lupyan explain, there are two asymmetries that make this task particularly challenging. Firstly, as illustrated in the bankmoney example, what is a salient association in one direction may not be a salient association in the other direction (as shown by word association studies; Nelson et al., 2004). Secondly, in communication, a producer and receiver have to work in opposite directions: the producer has to infer what is the best utterance to convey a given communicative intention, while the receiver has to infer what the producer’s most likely communicative intention is given her utterance. When these two types of asymmetries are combined, a difference in perspective is created: what is salient from the producer’s point of view may not be salient from the receiver’s point of view. This combination of the two forms of asymmetries occurs when interlocutors try to communicate about novel referents in the absence of physical or linguistic context.

To test whether interlocutors also take into account each other’s perspectives under these circumstances, Sulik and Lupyan (2018) devised three conditions of this novel signaling task: (a) an unconstrained version where producers were allowed to use any English word; (b) a version that contextually constrained the signal space, by having producers select a clue word from a predefined list; and (c) a version that contextually constrained the meaning space, by having receivers guess what the target word was from a predefined list. Sulik and Lupyan found that in the unconstrained condition, producers typically behaved egocentrically: sending the clue word that was most salient to themselves, given the target word. These clue words led to low success rates when receivers were asked to guess what the target word was. Similar results were found by Nedergaard et al. (2020) in a version of this experiment where participants took turns being the producer and receiver. These findings suggest that it is very hard for producers to take their receiver’s perspective into account when all they have to go on is shared world knowledge. In contrast, in Sulik and Lupyan’s condition that contextually constrained the signal space (but left the meaning space open), producers typically behaved allocentrically (i.e., taking the receiver’s perspective into account), and this accordingly improved receivers’ performance at guessing the target word. Furthermore, in the condition that contextually constrained the meaning space (but left the signal space open), the accuracy of the receivers’ guesses reached the highest performance across all conditions. Given that the only difference between the unconstrained condition and the constrained conditions is the availability of context, Sulik and Lupyan conclude that it is this context that made the improvement in perspective-taking possible, rather than the shared world knowledge itself.

The Sociocognitive Foundations of Language

Reaching mutual understanding requires that the receiver recovers the producer’s meaning, that is, the producer’s communicative intention, over and above the literal meaning of the utterance. But what does it mean for a producer to have a communicative intention? What psychological states are involved? In his seminal work on meaning, Grice (1957) argued that what sets communicative acts apart from noncommunicative acts is their intentional nature. A communicative act can only achieve its goal when it is recognized by the receiver as being a communicative act. Therefore, a communicative intention needs to be overt. Building on this work, Sperber and Wilson (1995) argued that human communication is ostensive-inferential; where the ostensive part refers to the producer’s role in making their communicative intention overt, and the inferential part refers to the receiver’s role in inferring the producer’s communicative intention (see also Levinson, 2000; Scott-Phillips, 2015, 2017). Sperber and Wilson argued that ostensive-inferential communication involves (a) an informative intention and (b) a communicative intention on the part of the producer. The informative intention captures what the producer wants to communicate, while the communicative intention captures that the producer wants to communicate. Note, however, that not every linguistic utterance is produced with the intention to inform; “Stop tickling me!” for example, is produced with the intention to induce a particular behavior. This led Moore (2017) to the following reformulation of Sperber and Wilson’s (1995) analysis of the intentions involved in ostensive-inferential (also known as “Gricean”) communication. A producer’s utterance is intentionally communicative if, and only if, they produce the utterance with the intention that:


the receiver produce a particular response, and


the receiver recognize that the producer intends (1).

According to Sperber (2000), holding as well as recognizing these intentions requires fourth-order metarepresentations of mental states. If Jaime wants to inform Andy that Bill got his first vaccination today, these representations are:

fourth order:
Jaime intends
third order:
that Andy should believe
second order:
that Jaime intends
first order:
that Andy should believe
that Bill got his first vaccination today.

The informative intention is encompassed by the first and second order, while the communicative intention is encompassed by the third and fourth order. Although it has been shown that human adults are in principle able to entertain such metarepresentations of others’ mental states up to as many as seven levels of embedding (O’Grady et al., 2015), one may question whether we use fourth-order metarepresentations in every single communicative interaction (see Moore, 2016; Scott-Phillips, 2016). Moreover, children’s ability to explicitly represent others’ mental states (known as mindreading or theory of mind) takes several years to develop, and goes hand-in-hand with their language development, rather than fully preceding it (Baldwin & Moses, 2001; Meristo et al., 2011; Peterson & Siegal, 2000; Pyers & de Villiers, 2013; Sabbagh & Baldwin, 2005; Slaughter & Peterson, 2011; Tomasello, 2000). This mindreading (or theory of mind) ability is what enables people to take each other’s perspectives in conversation, and therefore also underlies the ability to establish and maintain common ground (see Anderson, 2021; Bohn et al., 2019, 2021; Frank & Goodman, 2014; Goodman & Frank, 2016 for computational modeling work that captures the recursive social reasoning involved in pragmatic communication).

The apparent contradiction between this theoretical analysis arguing that complex representations of mental states are required for any ostensive-inferential communication on the one hand and the fact that mindreading abilities develop alongside language (rather than preceding it) on the other hand has led to new theoretical development in the analysis of what cognitive abilities are minimally required for doing ostensive-inferential communication (Moore, 2017). Moore shows that the analysis of what cognitive processes are involved in ostensive-inferential (or Gricean) communication assumes three cognitive abilities that are likely not present in infants. Firstly, it assumes a concept of belief, which is importantly different from the representation of other types of mental states (such as “knowing”) in that beliefs are about how another individual represents the world, in a way that may not correspond to reality (and therefore requires a form of representation that can be entertained completely independently from the individual’s own representation of the world; see e.g., Birch & Bloom, 2007). Secondly, it assumes the ability to make complex inferences about others’ goal-directed behavior. And finally, it assumes the ability to entertain fourth-order metarepresentations. Moore puts forward an analysis of a class of “minimally Gricean” acts, which do not run the gamut of human language interactions, but do satisfy a minimal definition of Gricean communication, and require none of these three cognitive abilities to be in place. This may then solve the issue of how infants are able to engage meaningfully in communicative interactions from a very young age and are able to acquire language simultaneously with developing their abilities to reason about others’ minds (see also Woensdregt et al., 2016 for computational modeling work on this topic).

While even adult communicators cannot directly observe the mental states of their interlocutors, and therefore fully know their intentions, it can be observed how intentions are made visible in talk-in-interaction. Language is a social action used to make our intentions known to others and, ideally, to have others perform next actions (including appropriate communicative responses) that correspond with our intentions, or intended meanings.

Interactive Mechanisms

Despite its seemingly random and chaotic nature, human conversational interaction is highly organized. Interlocutors take conversational turns in a structured manner that allows for parties to continually exchange turns and generally only have one producer at a time, though minimal overlap does occur (Sacks et al., 1974). This back-and-forth exchange is made up of turns that anticipate another, subsequent turn. These coupled turns are called adjacency pairs—they make up many of the basic structures of interaction, from greetings to question-answer sequences (Schegloff, 1968), and it is suggested that they are a universal across languages (Kendrick et al., 2020). The first-pair part of an adjacency is a social action that presumes a next-action in the second-pair part, each of which is performed by different parties. For instance, an invitation to dinner is an action that anticipates the response of either declining or accepting the invitation. The documentation of this systematic organization of conversation is the purview of conversation analysis (CA)—both a field and a methodology that details the sequential nature of human interaction through fine-grained analyses of phenomena in talk-in-interaction5 (for an overview of CA, see Sidnell, 2009). In addition to being highly frequent, these sequence-structured interactions are also rapidly produced in turn (Stivers et al., 2009), suggesting that receivers ascribe action to current, ongoing turns and formulate next-turns early and accordingly (Bögels & Levinson, 2017; Levinson, 2012). Turns-at-talk are not only orderly but also designed with the next-action and recipient in mind. That is, a producer will construct their turn in such a way that best conveys the intended action to a particular recipient, a phenomenon called turn design (see Drew, 2012 for an overview). This can be achieved with a combination of linguistic and nonlinguistic resources, including words/signs, syntax, prosody, timing, eye gaze, and gesture (Drew, 2012). By recruiting these resources, producers are able to build an utterance that returns the anticipated next-turn action. In constructing a turn best designed to the next action and recipient, a producer attempts to achieve mutual understanding—of both the content and subsequent action—with the recipient.

Though conversational interaction exhibits structure in turn-taking and timing, communication is susceptible to problems of hearing/seeing and/or understanding that can arise from noise, ambiguity, and a number of other factors. To address these miscommunications, participants can engage in “self-righting mechanisms” (Schegloff et al., 1977) adapted to such problems, namely repair. Conversation analysts have investigated repair as a (potentially) multiturn action in which interlocutors actively participate to resolve problems in communication (Dingemanse & Enfield, 2015; Drew, 1997; Jefferson, 1987; Manrique & Enfield, 2015; Moerman, 1977; Schegloff, 1992, 1997, 2000; Schegloff et al., 1977). Self-initiated repair, for example, is the producer’s attempt to adjust their utterance to avoid potential miscommunication. In effect, self-initiated repair is proof of concept of turn design (Drew, 2012). A producer may hesitate, restart, correct, or adjust information throughout the turn to prevent problems for the recipient. Recipients can acknowledge their grasp of the producer’s turn through a variety of feedback forms, which may include carrying out the intended next-action, or simply confirming understanding via backchannels. Backchannels are recipients’ minimal responses—often produced in overlap with another’s concurrent utterance—that provide feedback to the producer. Many backchannels are confirming tokens, such as “uh-huh,” “yeah,” “mmhm,” and head nods. Backchanneling can signal continuation as well as understanding of the current turn (Schegloff, 1982)—both of which are important cues to a producer. However, recipients may also be faced with situations in which backchannels are not appropriate, and instead they need to perform another action to ensure understanding.

Other-initiated repair (OIR) is a side sequence (Jefferson, 1972) that allows for deviation from the main action in order to first resolve trouble sources. While a preference for self-correction—which includes self-initiation and self-repair—exists with respect to turn organization, and even face-threat, “others” (i.e., not the current producer) do initiate repair sequences through various strategies in turns following the trouble source (Schegloff et al., 1977; see also Kitzinger, 2012 for an overview). For instance, following the news-telling “Shirley moved to Milwaukee,” repair may be initiated to pinpoint the trouble source. These strategies include open questions such as “huh?” or “what?”, specific question words like “who?” or “where?”, partial repeats of the problematic utterance with question intonation (“Shirley?”), or candidate understandings (“You mean Sharon’s sister Shirley?”; Schegloff et al., 1977). These initiation strategies engage the trouble source producer with the opportunity to repair the miscommunication, and a particular strategy may be selected by the repair initiator to help the producer identify the source of the misunderstanding. As may be gleaned from the examples of other-initiations given here, referential expressions appear to be a common source of trouble in conversation (Bazzanella & Damiano, 1999). In fact, addressing trouble sources in general is not an uncommon occurrence in conversation, as OIRs occur on average every 84 seconds (Dingemanse et al., 2015). Repair, then, is a vital interactional resource that halts the ongoing action in order to (re)establish mutual understanding.

Schegloff (1992) asserts “[t]he ordinary sequential organization of conversation thus provides for displays of mutual understanding and problems therein–one running basis for the cultivation and grounding of intersubjectivity” (p. 1301). Backchannels and OIR are two interactional mechanisms in which communication is streamlined in an effort to achieve and display mutual understanding. Backchannels are generally used to (passively) acknowledge mutual understanding, while OIRs signal breakdowns in mutual understanding and require highly interactive (i.e., jointly constructed) collaborations.


When conversation goes smoothly, mutual understanding is frequently acknowledged by nonactive participants (i.e., recipients) by means of backchannels, or minimal responses such as mhmm, yeah, uh-huh, and head nods. Backchannels make up a range of responses and similarly a range of functions. For instance, “yeah” or “oh no” can act as assessments6 on the prior information, while “uh-huh” functions as a continuer wherein the recipient acknowledges the current ongoing turn and signals its continuation (Bangerter & Clark, 2003; Schegloff, 1982). Similar to continuers, backchannels that function as tokens of understanding display passive recipiency. They are not meant as turns that seek the floor, despite potentially occurring at the end of a turn unit, but they do provide the producer with a cue to the recipient’s attention and understanding. By passing the opportunity to take a turn when providing a minimal response, then, recipients are acknowledging their understanding of the prior instead of taking a turn to initiate repair—an action that is always relevant (Schegloff, 1982). As a form of collateral communication (Clark, 1996, p. 241), backchannels instead provide metacommentary on the conversation itself rather than its content.

Backchannels can take on various forms and are ubiquitous across linguistic modalities. Common verbal backchannels include mhmm, yeah, uh-huh, and their equivalents across languages. Addressees’ mimicry of producers’ hand gestures in a tangram experiment also demonstrates mutual understanding by way of collateral communication (Holler & Wilkin, 2011). Other backchannel forms demonstrate the multimodal nature of interaction. Facial gestures can be used in isolation or jointly with speech/signs to do the working of backchanneling. Brow raises (Chovil, 1991), smiles (Brunner, 1979), and blinks (Hömke et al., 2017) have all been shown to function as backchanneling. Long blinks, for example, are often jointly produced with head nods or vocalizations to signal continued recipiency even in light of a current speaker’s self-repair (Hömke et al., 2017). Regardless of form, backchannels keep conversations progressing in a state of mutual understanding.

While backchannels may seem inconsequential—they are not always included in transcripts, subtitles, or in retellings of conversations, for example—producers rely on them when monitoring for understanding (Bavelas et al., 2000) and are quite sensitive to their absence (Krauss & Weinheimer, 1966). Evidence for a producer’s seeking of mutual understanding through these acknowledgment tokens can be seen in examples of try-marking. Speakers might, for instance, use rising intonation over a potentially problematic referent (Schegloff, 1982, p. 79). This rising intonation attempts to make the referential expression operate as a question: do you understand this referent? In designing a turn as such, the speaker provides the addressee with an opportunity to indicate their mutual understanding so that the conversation can progress.

However, as Schegloff (1982) observes, backchannels are participants’ claims to mutual understanding and do not necessarily mean mutual understanding has been achieved. Therefore, when recipients recognize their lack of understanding, a set of interactional resources becomes available, namely that of OIR.

Observational Accounts of OIR

Coordinating actions among individuals through language is a feat we often take for granted. Humans are able to carry out this amazing ability by negotiating turn allocation, making use of nonlinguistic information, and aligning to interlocutors’ meanings. Though we often do this without thought, sometimes things go wrong. Fortunately, conversation affords a systematic, social mechanism to resolve problems in communication, whether they arise from issues of hearing or seeing, or from misunderstandings; this is the process of repair. In fact, repair is always a relevant next action (Schegloff, 1982) and such sequences take precedence over other actions in conversation (Kendrick, 2015; Sacks et al., 1974), such as a response to a question; the need to resolve the trouble is a prerequisite to continuing the initial action. And often, in order to progress the conversation, what needs to be repaired is mutual understanding.

In the face of unrecognizable referents or inadequate next actions, for instance, OIR sets up a sequence of actions through which repairables are modified and retested for mutual understanding. OIR is a collaborative process aimed at (re)establishing intersubjectivity—or, mutual understanding—when “problematic understandings” arise (Schegloff, 1992). Who is “at fault” in these instances can be consequential for the (re)constitution of interactional relationships (Robinson, 2006), and may lie with either one or both interlocutors. While this “trouble responsibility” (p. 139) may be negotiated in the interaction, OIR is nevertheless initiated by the receiver through various strategies.

Initiation strategies of this type vary based on the degree of understanding achieved by the recipient. Three categories of repair initiation strategies7 are proposed by Dingemanse and Enfield (2015): open requests, restricted requests, and restricted offers (or candidate understandings). Open requests (e.g., “Huh?”) display little grasp of the prior and indicate a problem with hearing or understanding without pinpointing a specific trouble source as seen with “what” used in line 2 of Extract 1. Restricted requests pinpoint a specific trouble source as requiring repair (e.g., “To whom?”) and indicate a problem of understanding—usually by requesting clarification—or partial hearing. In Extract 2, “who?” in line 2 requests a repair to the ambiguous “they” in line 1, for instance. Finally, restricted offers are candidate understandings of the prior talk (e.g., “You mean X?”), which indicate a problem of understanding and seek confirmation. In Extract 3 and Figure 3, the recipient in line 2 adds “of cars?” to modify the phrase “polishing glass” as a means to confirm his understanding that A’s job is polishing car windows, not building windows, for instance. This candidate understanding is met with confirmation in line 3. Interlocutors tend to adhere to a principle of specificity in initiating repair, which is balanced by principles of conservation and division of labor (Dingemanse et al., 2015). That is, while it is cost effective to simply initiate with an open class “Huh?” it requires more work for the original producer to then repair the unidentified trouble source. Thus, being more specific (when possible) in indicating trouble divides the repair work equitably among the interlocutors. In launching these initiation strategies and adhering to their principles of use, recipients indicate a breakdown in communication and in particular in mutual understanding when repairing referential expressions.

Extract 1: Siwu (Neighbors_4875900 reproduced from Dingemanse, 2015)

Extract 2: Cha’palaa (CHSF2011_06_25S2_1350464 reproduced from Floyd, 2015)

Extract 3: Argentine Sign Language (ASAM_244140 reproduced from Manrique, 2016)

Figure 3. B offers a candidate understanding of A’s turn.

Source: Reproduced from Manrique (2016).

Phenomenological accounts based in CA describe OIR across a range of languages and suggest that certain features of repair are pragmatic universals (Dingemanse & Enfield, 2015; Dingemanse et al., 2015). Given the importance of repair in resolving problems in communication, it was supposed that such a mechanism should exist—in some form—cross-linguistically (Schegloff, 1987). Dingemanse and colleagues’ (2015) comparative metastudy indeed demonstrates that OIR is used across languages as disparate as Cha’palaa in Ecuador (Floyd, 2015); Lao (Enfield, 2015), Yélî Dnye (Levinson, 2015), and Malay (Mohd Jan & Saad, 2018) in Asia; Icelandic (Gisladottir, 2015) and Italian (Rossi, 2015) in Europe; Siwu (Dingemanse, 2015) in Africa; and Murrinh-Patha (Blythe, 2015) in Northern Australia. Moreover, OIR follows similar patterns of use in nonverbal modalities, including Argentine Sign Language (Manrique, 2016), American Sign Language (Dively, 1998; Most, 2003), Norwegian Sign Language (Skedsmo, 2020), Tactile Australian Sign Language (Willoughby et al., 2014), Swiss- German Sign Language (Girard-Groeber, 2020), and Chinantec whistled speech (Sicoli, 2016). The three basic strategies for initiating repair, as well as the principles for specificity and least collaborative effort, are found across various language families and isolates (Dingemanse et al., 2015) and across communicative modalities, thus demonstrating the universal nature of OIR. That is, while the phonetic and/or lexical realization of OIR may be language-specific, sharing these pragmatic universals means OIR is organized and functions similarly cross-linguistically.

Highlighting the multimodal nature of interaction, interlocutors can recruit numerous resources to initiate or supplement repair initiation. The incorporation of the face and body as semiotic resources in repair highlights the multimodal display of misunderstanding wherein these resources come to mutually elaborate one another (Goodwin, 2012). For instance, the face and body can be recruited to mark misunderstanding as with the eyebrow raise or furrow (Kendrick, 2015), the head poke or lateral tilt (Seo & Koshik, 2010), gaze patterns in Yélî Dyne (Levinson, 2015), and the freeze-gaze in Argentine Sign Language (Manrique, 2016). These resources are sufficiently embedded in doing repair such that even Tactile Auslan users (Deaf users of Australian Sign Language—Auslan—who have degenerative vision loss) continue to incorporate facial expressions and shrugs that would not be perceptible to their interlocutors but had been part of their Auslan interactions priors to vision loss (Willoughby et al., 2014). Recruiting embodied resources beyond the word or sign highlights a divergence in communication: something has gone wrong and it needs to be addressed.

Once repair has been initiated, it can then be resolved. Like repair initiations, repair solutions also may be taken up by either the original producer (self) or the recipient (other). With other-initiated self-repairs (OISR) that arise from problems of hearing/seeing, a producer might simply repeat the original phrase. With problems of understanding, however, the trouble source might be clarified or elaborated upon or replaced entirely with a new word or phrase (Schegloff et al., 1977). The degree to which the original utterance is modified in OISR depends on the type of initiation strategy used by the “other,” such that initiation strategies are designed for certain solutions (Dingemanse & Enfield, 2015). Trouble source repairers, then, design solutions to match what the recipient has located as problematic (recall recipient design in “Perspective-Taking”) and by doing so put forth a best effort attempt to regain mutual understanding. Once the repair is performed, and no further trouble sources emerge, the repair sequence is closed by resuming where the producer left off (e.g., responding to an invitation). Acknowledging the repair may also precede the continuation of the prior action; acknowledgment tokens such as “oh” and “right” will then come before returning to the original action (Heritage, 1984; Jefferson, 1972; Koivisto, 2015). These tokens, similar to backchannels, display a (potential) return to mutual understanding.

In itself, repair is not a guarantee of mutual understanding. It is an interactive process meant to align interlocutors with the goal of achieving mutual understanding in order to continue with ongoing social actions. The ultimate accomplishment is then the progression of the interaction, which may continue with participants having a “good enough” understanding of referents (Albert & de Ruiter, 2018; Robinson, 2014). Therefore, it is possible that intersubjectivity cannot be reached, even through multiple repair attempts, and that interlocutors may be forced to move on from problematic talk or from the interaction as a whole (for an extended example of this regarding reference, see Schegloff, 1992, pp. 1334–1337). In cooperative communication, though, we might suspect that interlocutors will repeat attempts to reestablish mutual understanding, at least within reason. Experimental studies involving OIR might lend insight into how interlocutors recruit this interactive mechanism in order to carry out cooperative communication tasks.

Experimental Accounts of OIR

Not only has repair been well documented in the service of resolving miscommunication in natural interactions, it has also been experimentally investigated. In various fields, including experimental semiotics, experimental pragmatics, language and cultural evolution, and dialogue studies, the question of when repair occurs, and what effect it has on the experimental task and its outcomes, has been explored.

The role of feedback—backchannel and repair mechanisms—in the dialogue surrounding tasks that are jointly carried out by participants is another line of inquiry in experimental semiotics. In a metastudy of numerous referential communication tasks (e.g., Lego building, maze, and tangram tasks), Bangerter and Clark (2003) identify backchannels such as mmhm, uh-huh, and yeah as horizontal project markers that allow for the continuation of ongoing actions in the task. Backchannels signal a “go-ahead” for directors to continue the current or next action, and pass on the opportunity to take the floor. Repairs, on the other hand, act as vertical project markers, which set apart subprojects in task-oriented dialogue. This is especially the case with clarification requests that set up a side sequence wherein information is requested, clarified, and acknowledged before returning to the task at hand (e.g., building a Lego structure, navigating a maze, or identifying a shape). Feedback not only influences the task dialogue but is also influenced by the context and goals of interaction. For instance, Fusaroli et al. (2017) find increased use of repair in task-oriented dialogue compared with free, spontaneous dialogue. Moreover, the frequency of repair types differs depending on the type of interaction. Task-oriented dialogue relies more on explicit negotiation and confirmations (i.e., the restricted repair formats in Dingemanse & Enfield, 2015) while spontaneous dialogue tends to favor more generic forms of repair (i.e., open request formats in Dingemanse & Enfield, 2015) and backchannels (Dideriksen et al., 2019, 2020; Fusaroli et al., 2017). These findings demonstrate how repair may be more relevant in certain contexts, namely in task-oriented talk and when doing reference. These contexts involve talk that becomes actionable by recipients, as producers might be expected to be more accountable for the veracity of their utterances.

In language evolution studies, the faithfulness of signal transmission is paramount given the context of establishing new conventions for communication; that is, linguistic alignment might be more consequential in cases of developing shared conventions, like labels. Some of these studies consider the role of interactive mechanisms in attaining mutual understanding through the conventionalization and systematization of emergent form-meaning pairs through iterated referential communication paradigms. In these experiments, participants are either given an artificial language to use in referring to objects, or they are asked to spontaneously create signals to refer to those objects. In some paradigms the referents are novel while in others the signal or modality is novel. The iterated nature of these paradigms involves the developing signal-meaning system to be “passed down” to another set of users, who then use and modify the system to be passed down to yet another set of users (called “generational turnover”). Though not many investigations of language evolution incorporate repair as an explicit factor in the conventionalization process, it has been found to varying degrees, and especially in early stages of iterated referential communication games, even when participants were not instructed on the ability to use it (Fay et al., 2010, 2018; Macuch Silva et al., 2020). These studies underscore the ubiquity of repair but they do not reveal that individual turns with repair aid the identification of the correct referent, or communicative accuracy (Fay et al., 2018). This finding, though, reflects the nature of repair in natural conversations; repair is a mechanism to facilitate mutual understanding, not a guarantor of it. In fact, the presence of repair is itself a signal of troublesome referencing (Micklos et al., 2020), and early repair in particular might facilitate accuracy in later turns (Micklos et al., 2018). However, the communicative efficiency of signals is supported by repair in iterated communication games. Just as previous studies using established languages have shown, referential expressions that could be modified through feedback processes are shorter, for example, gestures in Fay et al. (2013), or less complex, for example, drawings in Fay et al. (2018) as seen in Figure 4. What is most notable in these experiments is that repair is used as an interactive mechanism to attain mutual understanding even when not explicitly given as a tool.

Figure 4. Over repeated plays of a communication game, the drawn representations for “Parliament” become more simplified in conditions that allow for alignment (i.e., shared sign system allowed) and especially feedback (e.g., clarification requests).

Source: Reproduced from Fay et al. (2018).

In language evolution experiments more explicitly testing the effects of repair (or, “feedback”), it has been found that repair does help communicative efficiency. That is, the ability to repair can lead to increased alignment to less complex and more systematic communicative conventions. In graphical communication tasks, the opportunity to mutually modify a partner’s output led to more abstraction in the drawings (Healey et al., 2007) and to less drawing space (Fay et al., 2010, 2018). Healey et al. (2007) suggest the finding of “repair-driven co-ordination process” in communication could likely be observed in emerging gestural systems (and is widely demonstrated in vocal systems; Jefferson, 1987). In fact, in cross-signing and silent gesture tasks, pairs that engaged in repair saw greater systematicity in referential expressions (Byun, Roberts, et al., 2018; Micklos, 2016, respectively). And, though Macuch Silva et al. (2020) did not find an increase in communicative efficiency while using repair in terms of turn length, repair in these turns did lead to subsequently shorter turns. That is, the shorter the turn length, the more efficient the communication; undoubtedly turns involving repair would be longer than without it because repair inherently requires more turns. Despite focus on communicative accuracy and efficiency in these studies, some also reflect the findings from natural data with respect to strategy use and specificity. All three repair strategies are found in these experimental setups (Byun, Roberts, et al., 2018; Micklos, 2016), and there even tends to be a preference for using more restricted (i.e., specific) strategies, as in natural conversations.

It is important to note here that the affordances of a given experimental design (including communicative modality, stimulus modality, degree of interactivity, meaning space, etc.) will influence the degree to which (spontaneous) repair emerges. For instance, novel auditory stimuli are particularly difficult to repair with a candidate understanding, possibly due to the need to provide precise vocal reproductions for candidates (Macuch Silva et al., 2020). However, visual modalities appear more likely to lend themselves to repair sequences as they typically allow for: (a) visual access to the face of a coparticipant to monitor understanding, and (b) the opportunity to act upon a meaning space by, for instance, pointing to a gestured meaning. The phenomenological descriptions of repair similarly indicate the benefits of visual access. While access to facial cues seems to increase the likelihood of engaging in repair, modalities that produce visual artifacts can do so as well. For example, in Fay et al. (2010) abstract meanings were communicated in a Pictionary task wherein the graphical signal became an artifact upon which graphical repairs could be performed by drawing question marks, checks, or other indicators of misunderstanding. While only a limited set of repair strategies are possible in a given modality, it nonetheless remains a valuable resource for referential communication and mutual understanding in contexts of emergent conventions.

The Complementary Nature of Cognitive and Interactive Mechanisms

As mentioned in the “Introduction,” separating out the cognitive and interactive mechanisms that help us achieve mutual understanding creates an artificial divide: in reality, these two types of processes are intimately intertwined. This section discusses the ways in which they interact as exemplified by several experimental paradigms.

Dialogue—or conversation—is a jointly constructed activity with the goal of maintaining and achieving mutual understanding, as enabled by building common ground (Clark, 1996, Chapter 4; and others, see “Common Ground”), which can be attained through repair in instances of breakdown. Pickering and Garrod (2004) call this achievement alignment, wherein mental representations are shared between interlocutors. They argue that repair is a “primitive mechanism” (p. 172) that allows interlocutors to fix misaligned representations, and they point to clarification requests as a particular strategy that relies on implicit common ground. That is, clarifications check one’s own representations against their interlocutors’, as shown in the dialogue of participants in a maze task (Garrod & Anderson, 1987). If misalignment is not resolved, however, repair is iterative in that it can be repeatedly performed—with or without modification—until alignment is reestablished. Repair can also use full common ground, in which reliance on inference and explicit negotiation of representations may be involved (Pickering & Garrod, 2004). Additionally, Clark and Krych (2004) demonstrate that when directors and builders had visual access to one another’s workspaces in a Lego task, they are able to operate on actions made in the visible workspace, namely they could give feedback, in turn making task completion quicker and less error prone. Since participants in this condition had a “larger” common ground (by means of the director having visual access to the builders’ workspace, simulating a form of physical copresence), they could make use of interactive feedback more effectively. In a study that restricts the ability to use common ground, Fay et al. (2018) found that communication success diminished without the ability to align symbols via imitation, and that referential communication could be further complicated when feedback processes were also eliminated (see Figure 4). While alignment is often considered automatic in cognitive models, Fusaroli et al. (2017) argue that repair and alignment are intertwined, such that alignment can involve the strategic use of repair mechanisms like repetition and not only rely on structural priming (cf. Pickering & Garrod, 2004).

Much of the discussion on achieving mutual understanding here has been cushioned in referential expression, and empirical studies similarly investigate how repair affects referential communication and expression itself. “Perspective-Taking Based on Linguistic Copresence” shows that people take into account their linguistic common ground with a given interlocutor in the process of converging on descriptions of novel stimuli (tangrams). Pairs of participants develop increasingly short and opaque descriptions over a number of trials, and when asked to interact with a new participant, directors return to using longer descriptions (Brennan & Clark, 1996; Horton & Gerrig, 2002; Horton & Spieler, 2007; Wilkes-Gibbs & Clark, 1992). But how do pairs of participants arrive at these short and opaque descriptions? Early studies show that feedback, broadly defined as backchannels, acknowledgment tokens, and repair, led to shorter referring expressions for abstract figures (Krauss & Weinheimer, 1966). In follow-up studies that employed some modifications to the Krauss and Weinheimer method, namely the use of hard-to-describe tangrams (see Figure 2), requests for expansion—whether open or restricted—facilitated a decrease in the number of words used to describe, or refer to, a tangram, as well as the number of turns used to describe it (Clark & Wilkes-Gibbs, 1986). Clark and Bangerter (2004) later more specifically identified “other-repaired noun phrases” as the explicit request for help in labeling tangrams. An example of this process can be seen in Figure 5 from a recent replication of the tangram experiment but used in text-chat (Micklos et al., 2020). In “Perspective-Taking Based on Linguistic Copresence,” this example demonstrated that linguistic copresence and the building of a shared linguistic repertoire (i.e., emerging community comembership) allowed for pairs to reduce their descriptions. It can also be observed that repaired phrases (indicated by arrows) facilitate shorter descriptions for objects for which participants did not have conventionalized reference forms.

Figure 5. Repairs, as indicated by the arrows, on referential descriptions of tangrams led to shorter referential expressions over iterated communication such that “bowing man with a triangle bum” could become “triangle butt” or even “TB” over time.

Source: Reproduced from Micklos et al. (2020).

Similarly, in a maze task, novel location references had to be negotiated by participants; by using clarification requests composed of repetitions with rising intonation, participants were able to “systematize and refine their referring expressions” (Mills, 2014, p. 164). While feedback in general affects referential expressions themselves, Healey et al. (2018) suggest that repair more so than backchannels strongly influences the coordination of novel and emergent referential schemes in a maze task. It may be the case then that repair plays an important role in establishing referential expressions when interlocutors are faced with novel contexts and thus the ability to rely on certain features of common ground is reduced. Interactive repair may also form a solution or buffer when the resource demands of (recursive) social reasoning and inferencing simply become too high: distributing the process of reaching mutual understanding over multiple turns and minds may alleviate the burden on cognitive processes internal to the individual (Dingemanse, 2020). In line with this idea, van Arkel et al. (2020) showed using a combination of agent-based modeling and computational complexity analysis that agents using a simple repair mechanism can “offload” the computational demands of pragmatic reasoning onto interaction, achieving similar communicative success with lower computation cost.

Reference can also prove demanding in situations where interlocutors do not share a common language. Byun, Vos, et al. (2018) demonstrate that other-initiated repair (OIR) and try-marking occur in cross-signing encounters where individuals from different signing communities are brought together for a referential communication task. OIRs were most frequently performed as repetitions, further corroborating Mills’ (2014) findings with novel referents. Moreover, the majority of OIRs were initiated with try-marking that was indicated by sustained eye gaze, holding of the sign, mouthing, and repetition (Byun, Vos, et al., 2018). These resources are similarly found in both multimodal and signed repair sequences in nonexperimental contexts. Try-marking, as described in “Introduction,” exemplifies the complementary nature of cognitive and interactive mechanisms in attempts to maintain mutual understanding by making use of recipient design while simultaneously opening an opportunity for participant acknowledgment or repair. In addition, Byun and colleagues’ (2018) cross-signing study reveals how repairers of trouble source signs rely on common ground (e.g., shared and/or community knowledge) when resolving repairs. In Figure 6, signer B (from Uzbekistan) initiates repair on the problematic sign for subway through repetition. Signer A (from South Korea) offers a modification relying on assumed shared knowledge (that buses and subways are similar modes of transportation, except for their location above or underground) and potential community knowledge of sign for bus. Signer B acknowledges understanding with a simultaneous display of head nodding and repeating the trouble source sign for subway.

Figure 6. Cross-signing participants engage in repair and use of common ground to establish mutual understanding.

Source: Reproduced from Byun, Vos, et al. (2018).

Taken together, these investigations show how cognitive capacities and interactive resources are both present as well as mutually elaborated in order to achieve mutual understanding, whether in the context of already conventionalized language use or when conventions are either nonexistent or newly forming. The cognitive and interactive mechanisms that are used to achieve mutual understanding also presuppose one another in several ways. The cognitive mechanisms of perspective-taking in language use rely on interaction to constantly shape and sharpen interlocutors’ representations of common ground. The other way around, interactive mechanisms also presuppose cognitive mechanisms. For example, OIR is made possible by the receiver being able to infer that the producer had a communicative intention that was not met, and having the expectation that initiating repair will lead the producer to provide a more “recipient-designed” version of the utterance. Furthermore, Dingemanse et al. (2015) showed that the use of OIR (across a diverse sample of languages) follows several core principles that together demonstrate the cooperative nature of language use: by choosing the most specific repair initiator possible, people ensure that their collaborative effort is minimized, rather than their egocentric effort.

Summary and Future Directions

While cognitive capacities and interactive resources in human communication have been parsed in this article, it is clear that they work in concert with one another. Cognitive and interactive strategies for achieving mutual understanding in conversation are complementary processes that take place within the individual and between interlocutors. As with the initial example of try-marking, all mechanisms discussed here construct a larger process of working toward mutual understanding. Common ground can be inferred in the production of backchannels, where an addressee acknowledges understanding of another’s, and likely their shared, knowledge. Additionally, perspective-taking can be observed not only in instances of try-marking but also in other-initiated repair when repair initiators select a particular strategy to most accurately pinpoint trouble sources. Thus, repair initiation strategies employ recipient design to avoid underinformativeness and to facilitate more precise resolution. While evidence for cognitive and interactive mechanisms for mutual understanding comes from disparate fields, the unified perspective presented in this article should encourage the continued pursuit in understanding the complementary roles of these mechanisms in interaction.

From a conversation analysis (CA) perspective, repair does not guarantee a return to mutual understanding, but rather a return to progressivity—the action at hand—and therefore the mutual understanding achieved might be just “good enough” to do so. However, from a cognitive psychology perspective, repair itself is a signal of misunderstanding and it can be observed how producers of such a signal consider common ground and perspective-taking as they work to resolve miscommunication and reestablish mutual understanding. A “methodological fusion” between experimental psychology and CA has been proposed by de Ruiter and Albert (2017; see also Albert & de Ruiter, 2018; Schegloff, 1991), who argue that each method could benefit from the incorporation of the other. For instance, methods grounded in experimental psychology might benefit from the operationalized concepts and interaction analysis of CA, while studies involving CA could be enhanced by quantification and systematic testing (de Ruiter & Albert, 2017).

Levinson (2006) proposes a theoretical model called the human interaction engine that relies on the complex interplay of both cognitive and interactive “ingredients” (p. 54) as the crux of human evolution, socially and linguistically. On the cognitive side, the interaction engine is built from capacities to establish common ground, simulate others’ minds, and attribute intentions. Intertwined with these capabilities are the interactive features of the interaction engine, which include cooperation and observable practices, like that of repair. As seen throughout this article, this human interaction engine makes possible not just conversation using an existing language but is also relied upon—arguably all the more heavily—when linguistic conventions are not yet in place and have to be built over repeated interactions and/or generations in a cultural evolution setting. However, some of the cognitive and interactive mechanisms discussed in this article themselves make use of conventionalized signals. Computational modeling work shows that building a shared conventional language can improve agents’ ability to take each other’s perspectives; suggesting that language may have culturally coevolved with the sophisticated mindreading skills of modern humans (Woensdregt & Smith, 2017; Woensdregt et al., 2020). Furthermore, computational modeling work suggests that the systematic structure seen across human languages may interact in a positive feedback loop with an efficient repair system such as that seen across languages, as discussed in “Observational Accounts of OIR” (Woensdregt & Dingemanse, 2020). To conclude, the cognitive and interactive mechanisms that enable us to achieve mutual understanding in communication are not just intertwined with each other but may also interact in intricate ways with the language as a conventional system itself.

Further Reading



  • 1. This article will use a few terms interchangeably, but these reflect the typical use in the different disciplines outlined here. Producer and sender may be interchangeable, as are receiver and recipient. These terms have also been selected to reduce modality bias.

  • 2. This article will focus on referential communication as a case study for what makes everyday language interactions complex. Note, however, that empirical work has also been done on mechanisms for mutual understanding in other forms of communication, such as procedural coordination (Mills, 2014, 2011).

  • 3. When contrasting a hypothetical producer and receiver, this article will systematically refer to the producer as “she” and the receiver as “he,” for disambiguation purposes.

  • 4. Note that more recent work has added increasing nuance to the idea that overspecified utterances are always an indication of failure to take into account the receiver’s perspective. For example, Jara-Ettinger and Rubio-Fernandez (2020) developed the incremental communicative efficiency model, which argues that overspecification can be a sign of taking into account the receiver’s perspective when it aids the receiver’s visual search in real time (see also Degen et al., 2020; Rubio‐Fernandez, 2019).

  • 5. Talk-in-interaction, and talk more generally, is not exclusive to spoken language, but rather is recognized in the fields of ethnomethodology and conversation analysis to mean all forms of communication in interaction, including linguistic signals regardless of modality, and other resources such as posture, gestures, and facial expressions (Schegloff, 2006). In this article, the use of “talk” and “talk-in-interaction” should similarly be taken as modality-inclusive.

  • 6. Backchannels also have an affective function in which they show the interlocutor’s stance; for example, alignment to the current speaker and/or their utterance (Gardner, 2001). Facial expression and gestures similarly can perform this function, as with brow raises to show surprise (Ekman, 1979) and nods to display affiliation (Stivers, 2008).

  • 7. For more details on OIR strategies, see Dingemanse and Enfield (2015) and Floyd et al. (2016).