Show Summary Details

Page of

Printed from Oxford Research Encyclopedias, Business and Management. Under the terms of the licence agreement, an individual user may print out a single article for personal use (for details see Privacy Policy and Legal Notice).

date: 25 February 2024

Experience Sampling Methodologyfree

Experience Sampling Methodologyfree

  • Joel KoopmanJoel KoopmanMays Business School, Texas A&M University
  •  and Nikolaos DimotakisNikolaos DimotakisSpears School of Business, Oklahoma State University


Experience sampling is a method aimed primarily at examining within-individual covariation of transient phenomena utilizing repeated measures. It can be applied to test nuanced predictions of extant theories and can provide insights that are otherwise difficult to obtain. It does so by examining the phenomena of interest close to where they occur and thus avoiding issues with recall and similar concerns. Data collected through the experience sampling method (ESM) can, alternatively, be utilized to collect highly reliable data to investigate between-individual phenomena.

A number of decisions need to be made when designing an ESM study. Study duration and intensity (that is, total days of measurement and total assessments per day) represent a tradeoff between data richness and participant fatigue that needs to be carefully weighed. Other scheduling options need to be considered, such as triggered versus scheduled surveys. Researchers also need to be aware of the generally high potential cost of this approach, as well as the monetary and nonmonetary resources required.

The intensity of this method also requires special consideration of the sample and the context. Proper screening is invaluable; ensuring that participants and their context is applicable and appropriate to the design is an important first step. The next step is ensuring that the surveys are planned in a compatible way to the sample, and that the surveys are designed to appropriately and rigorously collect data that can be used to accomplish the aims of the study at hand.

Furthermore, ESM data typically requires proper consideration in regards to how the data will be analyzed and how results will be interpreted. Proper attention to analytic approaches (typically multilevel) is required. Finally, when interpreting results from ESM data, one must not forget that these effects typically represent processes that occur continuously across individuals’ working lives—effect sizes thus need to be considered with this in mind.


  • Human Resource Management
  • Organizational Behavior
  • Research Methods


The goal of this review of experience sampling is to provide a detailed discussion of the basics of this methodology that are often not covered as thoroughly in the many other valuable reviews on the topic (Beal, 2015; Beal & Weiss, 2003; Dimotakis et al., 2013; Gabriel et al., 2019; Ohly et al., 2010; Reis & Gable, 2000; Weiss & Rupp, 2011). In so doing, this article is most directly targeted to those at a beginner-to-intermediate level of knowledge and interest with experience sampling. However, while much of this information is somewhat introductory, the goal is to discuss this methodology and some associated issues in a straightforward manner that even seasoned experience sampling researchers may still find useful.

There is certainly no better time for scholars to acquaint themselves with this method. For some, the goal may be to become a primary user to test research questions; to this point, some have argued that many theories may be better tested with this methodology (Beal, 2015; Dalal et al., 2014; Gabriel et al., 2019). For others, the goal may be to become a more informed reader and reviewer which is important as well, given that the number of these papers being published in management and applied psychology is on the rise (McCormick et al., 2020; Podsakoff et al., 2019). In either case, this article should provide useful and valuable information.

What Defines a Study as Experience Sampling?

To begin, while experience sampling seems to be the most common name for this methodology, it does go by several other names (i.e., ecological momentary assessment, everyday experience sampling, event sampling, or daily diary method; McCormick et al., 2020). There are semantic reasons that make these terms not completely synonymous, but such differences are not germane to this discussion. Overall, the term experience sampling (ESM) seems to be the preferred label for this method. In terms of what constitutes an experience sampling study, there is no singular, defining characteristic. Rather, the consensus among scholars is that a study can be considered to use experience sampling when certain criteria are met (e.g., Beal, 2015; Beal & Weiss, 2003; Reis & Gable, 2000): repeated measures, a focus on transient (as opposed to stable) phenomena, and the examination of within-individual (co)variation. However, there are some important nuances pertaining to these criteria.

Repeated Measures

Experience sampling involves repeated measurement (i.e., participants complete multiple measures of the same construct[s] over a period of time). This is a necessary but not sufficient condition, as both longitudinal and latent growth designs also use repeated measures. This point is important because the data structures for these methods share similarities which may prompt reviewers to inquire as to why experience sampling, as opposed to one of these other methods, was used. This point will come up again in the section titled ‘Some Remaining Points’ near the end of the article, and so for now it will suffice to say these methods are not interchangeable, as each is focused on different research questions and makes different assumptions about the underlying phenomena of interest. Note also that saying experience sampling uses repeated measures leaves much unsaid about some specific aspects of the design of these studies—another topic that will be covered in more depth.

Transient Phenomena

Experience sampling studies are designed to examine how transient state(s), behavior(s), or experience(s) affect subsequent state(s), behavior(s), or experience(s)—and ideally the mechanism(s) and/or boundary condition(s) for those effects. Thus, experience sampling is not generally employed to examine theoretically stable characteristics of a person (i.e., personality) or their situation (i.e., norms), though these may be of interest as boundary conditions or integrated into an experience sampling design for empirical (i.e., measurement) reasons.

In explicating this criterion, two points that Beal (2015, p. 385) emphasized in his list of the elements that define experience sampling should be highlighted. First, Beal noted that experience sampling captures “experiences as closely as possible to how they would naturally [emphasis added] occur.” While it is the case that most experience sampling research examines natural experiences, a strict reading may imply that studies manipulating an initial state might then not be seen as experience sampling. (For an interesting stream of research that manipulates daily states, see Jennings et al., 2022; Lanaj et al., 2019; Song et al., 2018). With that said, the distinction is somewhat esoteric, and so it is likely Beal would not disagree with this point.

Beal also noted that experience sampling should prioritize “concrete and immediate experiences over abstract or recalled experiences.” This emphasis was particularly prevalent in early work (e.g., Larson & Csikszentmihalyi, 1983). Yet, most examples of experience sampling research involve at least some degree of recall—it is difficult to complete a survey while actively engaged in workplace conflict! To this point, Reis and Wheeler (1991) asked participants to record aspects of social interaction following such an instance. This necessarily involves some degree of recall and reconstruction, as it relies on participants to initiate the recording procedure which cannot occur during the interaction and may not always be done immediately afterwards. Moreover, experience sampling studies often ask participants to report what they have felt, done, or experienced “last night,” “since arriving at work today,” or “since the last survey” (Lanaj et al., 2014; Puranik et al., 2021; Rosen et al., 2016). From there, it is more a difference in degree than type in terms of studies that ask about experiences “over the past weekend” (e.g., Binnewies et al., 2010) or “over the past week” (e.g., Matta, Sabey, et al., 2020).

A reasonable concern with this discussion is that it becomes unclear where to draw the line. In general, the consensus in the extant literature seems to be that “over the past week” is acceptable—and indeed may be necessary for some phenomena which have within-individual fluctuations but unfold over longer periods of time (Chawla et al., 2019; Da Motta Veiga & Gabriel, 2016; Schaubroeck et al., 2018)—but longer time periods can possibly be an issue. This is because such recollections rely on aggregations of experience which may omit details of common workplace experiences (e.g., small talk; Methot et al., 2021) and become tainted by affective processes (see Forgas, 1995). Beal (2015, p. 387) also noted recall can be “tinged by personality and other aspects of semantic memory (i.e., general and conceptual knowledge) rather than by episodic memory (i.e., knowledge of events and experiences).” Robinson and Clore (2002) have an excellent discussion of this issue to which readers should refer for more detail.

An important point from Robinson and Clore (2002) is the argument that people can reasonably recall experiences that happen to them over the period of a week, and indeed, it is common to frame questions as “over the past week” in non-experience sampling work as well (Belmi & Pfeffer, 2016; Koopmann et al., 2021; Rosen et al., 2014). Thus, while recall may be a limitation of an experience sampling study focused on “the past week,” the increasing prevalence of research adopting this approach suggests that it is seen as reasonable. Of note, however, is that periods of time longer than a week are likely increasingly problematic, though if the focus were on low frequency events, and/or something negative, salient, or otherwise novel, then perhaps the approach may be seen as justifiable (Baumeister et al., 2001; Schwarz & Clore, 1983; Simon et al., 2015).

Within-Individual (Co)Variation

A final element of experience sampling is its focus on within-individual (co)variation (Beal & Gabriel, 2019). This goes with the previous discussion in our section on ‘Repeated Measures,’ as by obtaining several reports of the same construct over time, researchers can examine the variance that occurs “within-individual.” As an example, using a 1–5 Likert scale, a person might report a level of “2” for positive affect one day, but a level of “4” the next day. This contrasts with the variance that occurs “between-individual”; for example, one person reports a level of “2” for their positive affect, while another person reports a level of “4.” In the first example, the variation in positive affect occurs across measurement periods within one employee (a person has a higher level of positive affect on some days relative to other days), while in the second, the variation occurs across employees within one measurement period (some people have a higher level of positive affect relative to other people).

The ability to examine within-individual (co)variation is often highlighted as a hallmark of experience sampling research and has been used to promote the importance of this research. To this point, Beal (2015, p. 385) emphasized the theoretical importance of experience sampling research, as most theories in which scholars are interested are inherently about understanding the “sequences of events and event reactions that play out within each person’s stream of experience.” Gabriel et al. (2019, p. 971) echoed this point, stating “theories are often specified in terms of how an event, perception, state, or behavior yields subsequent reactions (within-person phenomena), but are evaluated between-person (cross-sectional surveys).” This is nuanced but important.

Consider the job-demands resources model (Demerouti et al., 2001). Demerouti and colleagues described how the experience of environmental stressors mobilize physiological and psychological processes when experienced by a person. This implicitly suggests that those processes would then not be mobilized when they are not being experienced. Note that there is nothing about this theory that prevents its testing in a between-individual sense (i.e., people who experience greater levels of environmental stressors are likely to experience activation of these processes)—indeed, a theory’s applicability at multiple levels of conceptualization is a hallmark of its generalizability (Voelkle et al., 2014). But as written, this theory implicitly focuses on intra-individual experiences to a phenomenon. The same is true of control theory, which involves comparisons between a person’s current and desired states with effects on subsequent behavior (Carver & Scheier, 1981). Certainly, principles of control theory can and have been tested between-individuals, but to the extent that the reference is transient, then the comparison should be within-individual. Thus, experience sampling as a study design element is inextricably linked to the examination of within-individual variation.

There is an important caveat, however. Even though the measurement approach inherently relies on within-individual (co)variation, the research question need not necessarily involve examining the consequences of this variance. It is true that most studies employing experience sampling have research questions that focus on that (co)variation, but study design need not dictate the scope of the research question being asked. As to why one would conduct an experience sampling study if there is no interest in examining within-individual (co)variation, one answer invokes the previous discussion of recall bias.

Consider two research teams, both interested in examining the affective consequences of receiving help at work. Both teams decide to test this question between-individuals over a span of two weeks (i.e., comparing the weekly positive affect of employees who receive greater levels of help to the weekly positive affect of those who receive lower levels over the past week; a between-individual question). The first team employs a design wherein at Time 1 they ask employees to report how much help they received over the past week. One week later (Time 2), they ask employees to report on their affective state over the past week. This is a common design that can effectively test many research questions (e.g., Griep et al., 2021; Koopmann et al., 2021). However, it would not be unreasonable to question the mental processes by which the participants aggregated all of their episodic experiences with receiving helping during that week (Robinson & Clore, 2002)—that is, did the person focus on the most recent episode, the most salient, the first one that came to mind, some prototype of what an episode generally looks like, or an average of all of them?

The other team, recognizing potential limitations associated with recall, instead conducts an experience sampling study during those two weeks (asking how much help was received each day of the first week, and about the employee’s affective state each day of the second week). The responses are then aggregated to the weekly level, which arguably provides a more accurate view of the focal construct (Bolger et al., 2003; Dimotakis et al., 2013). On a related note, Schulte-Braucks et al. (2019) did a variation of this by aggregating daily reports of enacting illegitimate tasks and self-esteem to understand the relationship among these at the between-individual level. Bolger et al. (2003) also suggested another way to leverage the multiple reports obtained through experience sampling—by not examining average or typical levels of phenomena but rather their variation (i.e., to examine its level of consistency or variability). In this way, a unique between-individual construct can be created (Johnson et al., 2012; Matta et al., 2017).

A History of Experience Sampling

Early Experience Sampling Research

Today, experience sampling is ubiquitous in management and applied psychology journals (Podsakoff et al., 2019); however, this was not always the case. The basic methodology actually dates to studies of employee mood in the early 1900s (Flügel, 1925; Hersey, 1932), but the positivist emphasis within psychology that accompanied the behaviorism movement over the next few decades (e.g., Schmidt et al., 1981; Wheeler & Reis, 1991) left little room for research on one’s inner experience or internal processes. Thus, it was not until the cognitive revolution that scholars rediscovered interest in the self-monitoring or self-recording of personal events and experiences. It was within this paradigm that Csikszentmihalyi and colleagues (e.g., Larson & Csikszentmihalyi, 1983) coined the term “experience sampling” in their investigations of flow. From there, usage of the method increased slowly in social psychology (e.g., Diener et al., 1984) before crossing into organizational work in a series of articles examining the effect of work and family factors on mood states (Alliger & Williams, 1993; Williams & Alliger, 1994).

Experience Sampling in Organizational Scholarship

Even after this initial crossover into management and applied psychology in the mid-1990s, it took another five years for experience sampling to reappear (Teuchmann et al., 1999; Weiss et al., 1999), followed by work from Fisher (2000, 2003; Fisher & Noble, 2004; Fuller et al., 2003) and other authors (Ilies & Judge, 2002; Miner et al., 2005), along with an in-depth methodological discussion from Beal and Weiss (2003). Looking back at the beginnings of what would become an explosion of research using experience sampling, it is instructive to think about why the method took off at that time.

Essentially all these manuscripts were focused on employee affective experience. Yet the more recent articles landed on fertile ground as they were situated within the “affective revolution” that was occurring in psychological and organizational scholarship at the time (Barsade et al., 2003, p. 3). That is, the 1980s–1990s were a turning point as affect moved from being viewed as an irrational, nuisance construct to a set of structured constructs that have an impact on organizational behavior (Fisher & Ashkanasy, 2000; Frijda, 1988). As a consequence, the transient and highly variable nature of affective experience was increasingly recognized as a crucial element of study design (Sheldon et al., 1996)—that is, something to be studied rather than controlled for or assigned to the error term of the model (Schmidt et al., 2003). Experience sampling was well suited to study this phenomenon both conceptually (the fluctuating yet potentially persistent character of affective phenomena and experience sampling were a good fit) as well as methodologically (as experience sampling can assess affective states as close to their occurrence as possible, thus avoiding or minimizing problems with retrospective recall methods).

At the same time, researchers were able to move away from the traditional paper-and-pencil diaries that had until then been used in experience sampling research (e.g., Alliger & Williams, 1993; Reis & Wheeler, 1991). To this point, most work involved giving participants numbered and labeled surveys (“diaries” in the parlance of the era) that were to be filled out at designated times or following specific events. These diaries would then be mailed back to the researchers. Then at the turn of the century, researchers began to “take advantage of new technology offered by personal digital assistants (PDAs)” (Beal & Weiss, 2003, p. 447). While the PDAs and their software were themselves quite buggy, this technology helped legitimize experience sampling as a method for collecting robust, time-stamped data.

The research discussed above was critical to demonstrating how experience sampling can make theoretical contributions to the affect literature (Miner et al., 2005; Schimmack, 2003). Yet while this is a large literature with connections to many phenomena, it still reflects a single area of inquiry. During that same period, however, several other manuscripts were published that showed the potential for experience sampling to contribute to other distinct and different literatures. For example, Sonnentag (2003) used experience sampling to highlight the importance of daily recovery, and several teams used it to examine aspects of the job demands-control model (Butler et al., 2005; Daniels & Harris, 2005; Elfering et al., 2005). In total, the work discussed revealed the utility of experience sampling for the research questions asked in organizational scholarship.

Experience Sampling Goes Mainstream

An inflection point for experience sampling seemed to occur in 2006. This point is illustrated by the exponentially-increasing line depicting the publication-year of ESM studies in Figure 1 from a recent review by Podsakoff et al. (2019). In addition, 2006 saw the publication of a paper in the Academy of Management Journal—an outlet that seeks both to publish research that is applicable to a broad cross-section of scholars and sets a high bar for the expected contribution of manuscripts (Ilies et al., 2006). This seemed to be a tacit acknowledgement that experience sampling was seen as a robust methodology that could be used to make important contributions in a wide number of literatures. From here, the number of papers using experience sampling increased exponentially as scholars used the method to examine the work–home boundary (Ilies et al., 2007), employee deviance (Judge et al., 2006), workday breaks (Trougakos et al., 2008), and work engagement (Bakker & Xanthopoulou, 2009). While these early studies still tended to rely on either PDAs or pencil-and-paper surveys, scholars soon began transitioning to web-based survey platforms (e.g., Loi et al., 2009; Wanberg et al., 2010). Thus, with this as the backdrop to the current state-of-the-science, the remainder of this article is devoted to examining a series of important topics focused on demystifying experience sampling from an operational standpoint.

Important Topics for Experience Sampling

Research Questions That Can Be Investigated With Experience Sampling

As discussed in the section ‘Within-Individual (Co)Variation,’ a key feature of experience sampling is that it permits researchers to examine within-individual covariation among constructs. Although researchers can aggregate those reports to examine between-individual covariation, experience sampling studies typically investigate research questions that are within-individual in nature. Also, while these studies can be used to examine a phenomenon over time periods of a week or month (Matta, Sabey, et al., 2020; Simon et al., 2015), most tend to focus on daily effects. Thus, experience sampling is best used for research questions such as explaining why individuals who experience higher (or lower) levels of an event, state, or behavior on a given day tend to report higher (or lower) levels of a subsequent state or behavior, compared to other days. Note that this differs from a between-individual study which would ask a similar question but makes other people the reference for comparison (instead of the same person on a given day).

For this reason, researchers must have theoretical reasons to think the phenomenon of interest varies within-individuals in the chosen span-of-time (e.g., daily or weekly). This assumption is generally met relatively easily for affective states, job related attitudes, and [especially positive] behaviors. To this point, some affective states tend to be relatively short-lived (Cropanzano et al., 2003), attitudinal judgments can be influenced by momentary experiences (Hastie & Park, 1986), and people have multiple motives or goals that underlie behavior which makes that enactment subject to within-individual variation in the particular antecedent conditions (Bolino et al., 2012). As it pertains to more negative (deviant) behaviors, there is plenty of research to suggest these do vary within-individually on a daily basis (e.g., Koopman et al., 2021; Mitchell et al., 2019; Yuan et al., 2018). However, base rates for deviant behavior are notably lower than for more positive behavior such as task performance or citizenship (Hershcovis, 2011; Robinson et al., 2014), so researchers studying deviant behavior may wish to consider longer time frames (as long as it fits the research question) to ensure that they can capture the necessary within-individual variation.

Experience Sampling Study Design

There are many important study design factors to consider for an experience sampling study. Some of these are the same as would be relevant when running any study (e.g., choosing an appropriate sample, administering the surveys, etc.). However, even some of these factors can have added complications when conducting an experience sampling study, and then there are some novel considerations that arise as well. Note that there are few rules when it comes to experience sampling study design—rather, they are more like guidelines, and there are both exceptions and tradeoffs to each decision.

Study Duration

To reiterate, experience sampling employs repeated measures. This means participants complete the same measures multiple times over a given period. For a daily experience sampling study, that period is often every day for one to three weeks. There are studies that have run for four weeks, but they are taxing on participants (Ilies & Judge, 2002; Koopman, Rosen, et al., 2020; McClean et al., 2021). As we discuss in the subsequent paragraph, two weeks seems accepted as the prototypical duration (e.g., Dimotakis et al., 2011; Gabriel et al., 2021; Scott & Barnes, 2011). However, there are some reasons to prefer three weeks if possible (e.g., Koopman, Lin, et al., 2020; Koopman et al., in press; Lanaj et al., 2016; Yang et al., 2016). Importantly, this decision about how many weeks has conceptual and empirical implications.

Conceptually, at issue is that experience sampling is a way to assess participants’ “lived-through” experience (Weiss & Rupp, 2011, p. 83). To do so, the study must be conducted (a) during a “normal,” “typical,” or an otherwise representative window of each participant’s life, and (b) for long enough to capture a generalizable picture of the same. These are obviously conditions that cannot be directly measured. For the former, typically the authors must make an informed assumption (e.g., if the sample is U.S. accountants, then April may not be a good time for a study, though even here that would depend on the phenomenon of interest). For the latter, while authors generally do not explicitly justify the duration of their study, when they do, it typically involves citing Reis and Wheeler (1991) who stated that “we have found that 1 to 2 weeks is the optimal record-keeping duration” (p. 286) and later said that “the 2-week record-keeping period is assumed to represent a stable and generalizable estimate of social life” (p. 289). Note that these authors were primarily interested in socializing patterns among college students, and not work interactions, thus there is nothing inherently definitive or authoritative in that particular assertion. This is one reason why a three-week design might be beneficial, as this is a longer window of participants’ lives. Similarly, this is why one-week designs might be limited, though this does not mean such designs are not publishable (e.g., Bakker & Xanthopoulou, 2009; Matta, Scott, et al., 2020; Sonnentag, 2003).

Empirically, aside from statistical power (i.e., the more weeks of data collected, the more data points for the analysis), there is another consideration which is that the analysis of experience sampling data typically involves estimating person-level intercepts and slopes (i.e., random effects). The estimation of these effects is heavily dependent on the number of within-individual data points. For a more complete explanation of this issue, please consult Snijders and Bosker (2012). For models of even moderate complexity, the number of random effects to be estimated can be significant. Thus, having as high an average number of such points as possible is desirable (hence the added value of a third week of data collection).

Surveys Per Day

Beyond the number of days, researchers must also decide how many surveys per day to require of participants, and when to send out such survey(s). This is a critical decision. Notwithstanding the logistical elements this decision influences (e.g., more daily surveys generally equate to greater cost and an increase in the difficulty of running the study), it also affects how well the data can test the research question. As an example, if the question involves how start-of-work states influence behavior across the day, this design should probably involve at least two daily surveys. Very broadly speaking, it is useful to think of the day in four distinct windows or “chunks”: start-of-work, mid-work, end-of-work, and after-work. These partitions represent theoretically relevant and operationally well-defined periods of time.

Most commonly, experience sampling studies seem to sample from either two or three of these windows. Administering only one survey per day is risky—if the theoretical model includes mediation, such a design largely mimics a same-time/same-source study, which is conceptually and empirically problematic (Kozlowski, 2009). This is not to say examples do not exist (e.g., Matta et al., 2017; Menges et al., 2017); however, there were arguably mitigating circumstances in both. Matta and colleagues’ contribution focused on the moderation of the first stage effect, whereas Menges and colleagues obtained an objective measure of each participant’s daily performance. It is also common to not exceed three surveys per day, but there are recent studies that—for reasons specific to their research questions—used either four (Anicich et al., 2021) or five (Frank et al., in press) daily surveys. The key here is keeping the surveys short and making sure that the design is relevant to the phenomenon. Also note that for studies using a weekly or longer design, typically participants will do only a single survey and therefore this survey will need to have all study measures on it. If testing mediation with a weekly design, it will be important to separate at least one of the measurement periods from the other two. Thus, it is common to collect at least five to six weeks of data (Matta, Sabey, et al., 2020; Schaubroeck et al., 2018).

Frame of Reference for Daily Measures

This point follows directly from the discussion on the number of surveys, as the time frame over which participants are asked to respond must be in alignment with when the surveys are administered. Participants are asked to consider their experiences, states, or behavior over some period of time. Experiences reflect some environmental condition or an event that has happened. Typically, these are measured by framing the items as “since arriving at work today,” “since the previous survey,” “today at work,” “last night,” “over the past week” (if using a weekly design) and so on. It will usually be some clear and definable period of time that creates explicit temporal boundaries (e.g., Rosen et al., 2016). Behaviors are often worded the same. Indeed, because behavior is often a dependent variable, temporal boundaries are critical to arguing that the actions in question follow the antecedent experience or state. States reflect a person’s cognitive or affective circumstances. Often these are framed as “right now” (e.g., Johnson et al., 2014), but it is not uncommon to frame them similarly as with experiences (e.g., Gabriel et al., 2018). Notably, Puranik et al. (2021) found a correlation of .86 between the same measure framed as “right now” versus “since arriving at work.” While this suggests the wording of state measures may not be much of a concern from an empirical standpoint, this is very important from a conceptual standpoint.

Consider a model such as that from Methot et al. (2021) in which the independent variable (small talk—an experience) is measured as “since arriving at work” and the mediator variable (positive emotion—an affective state) is measured as “at the present moment”—both are collected on the same survey. Theoretically, these authors argued that small talk will be associated with subsequent positive emotion. While the authors could likely also have worded positive emotion as “since arriving at work,” their choice of wording should more effectively separate the independent variable and mediator in the minds of the participant (i.e., an experience they had earlier and a state they feel right now). In fact, had the authors worded positive emotions as “since arriving at work,” they may be less able to respond to a criticism that positive emotions could lead one to engage in small talk, as the measurement time periods would have been concurrent. This issue is arguably alleviated with the wording they adopted, as it would not make sense conceptually to argue that how a person feels “right now” could have influenced their experiences over the past several hours. Similarly, adopting “since arriving at work” for both measures leaves researchers more open to omitted variable problems (maybe there was a work event that led to both higher positive emotions and more small talk). While changing the wording does not, of course, solve the omitted variable problem, it does make it somewhat less likely as the change in temporal frame would likely attenuate the relationship of the omitted variable to at least one of the other two.

Notwithstanding these arguments as to why this approach may be theoretically valid (particularly when worded as the authors did and presuming that the dependent variable is measured on a separate survey or from some alternate source), researchers may still experience some pushback from reviewers who could argue that the measures should have been temporally separated. Yet beyond imposing an additional burden on researchers (and study participants), this design aspect can be potentially damaging to the inferences that can be drawn. Consider the following model: an employee’s enacted helping is associated with positive affect. Helping is measured mid-work at 12 p.m. with the stem “since arriving at work today” (thus covering behavior during the past three to four hours). Positive affect is measured at the end-of-work at 4 p.m. as one’s state “right now.” Thus, this model proposes that one’s behavior between 8 a.m.–12 p.m. affects how that person feels at precisely 4 p.m. Yet note that this design does not include any measurement of what the employee felt or did between 12 p.m. and 4 p.m. This design makes it hard to rule out whether some other experience could have occurred in that intervening time period that was more proximally related to the reported positive affect, and puts the burden on the researchers to be prepared to rule out a myriad of other alternative explanations or contributing factors. And importantly, the greater the within-individual variability in the outcome variable, the more severe this problem would be, as it would illustrate greater susceptibility of this construct to daily antecedent constructs. In contrast, had the authors measured positive affect at 12 p.m. with the same “right now” stem, then the phenomenon and subsequent state would be better linked—this is precisely how these variables were measured by Koopman et al. (2016).

However, this discussion should not be read as indicating that the measurement of an intervening state should always be framed as “right now.” Instead, there is an alternate scenario that is also common. Imagine a model which proposes negative affect is associated with enacted incivility. The authors collect both negative affect and enacted incivility on the same end-of-work survey. If negative affect is measured “right now,” and enacted incivility is measured “since the previous survey,” there is a conceptual problem. Temporally, what occurred “since the previous survey” occurred before “right now,” yet empirically these are modeled in reverse. Perhaps authors could use this argument from Puranik et al. (2021) to say the relationship would have been the same, but it would be better to not be in this situation at all. This leads to a recommendation for study design purposes. When an experience or behavior is thought to affect a subsequent state (and both are measured on the same survey), then it may be better to measure that experience or behavior as retrospective over a previous period of time and that state as “right now.” If a state is thought to affect a subsequent behavior or situation (and both are measured on the same survey), however, then it may be better to measure both as retrospective over a previous period of time.

Alternate Scheduling Approaches

Note the discussion thus far has been about schedule-based (or interval-contingent) sampling. There are alternatives to this (see Dimotakis et al., 2013), but they have become less common over time. For example, surveys can be sent (semi)randomly through the day, but this can create problems particularly for capturing behaviors or experiences, as the timing of surveys is often to ensure that participants have had opportunity to have the experience or enact the behavior in question. Another alternative is trigger-based (or event-contingent) scheduling, in which no fixed number of surveys are given—rather, individuals are asked to fill out a survey any time a triggering event occurs. These may have demand effects and lead to hypothesis guessing, as the particular event needs to be explained to participants so they can complete a survey when the experience occurs. However, it is worth noting that technological advancements such as location tracking or heart rate monitoring can be used to trigger surveys, which may allow for increasingly sophisticated study designs such as sending a survey after interactions with a supervisor or coworker, or when a person is experiencing physiological strain.

Resources (Monetary)

Perhaps one of the most significant constraints associated with experience sampling studies are the resources required to administer them. Monetary resources are often what people think of first, and these costs can be significant. It is not uncommon to pay participants up to $60–$70, and Gabriel et al. (2019) noted that the average study tends to be about 83 people (after attrition, which suggests that researchers should ideally aim for at least a starting sample size of 100–125). Thus, such a study may cost between $4,980 and $8,750, though it is common with experience sampling to pay “per survey,” so not everyone earns the top-line number.

However, this presumes the study will only need self-reports. If other-reports are to be employed as well (e.g., examining spillover or crossover between spouses), then the cost can increase. If the relationship of interest is unidirectional (i.e., an employee whose experiences spillover to their spouse), then it is common to ask spouses only to complete a single daily survey each evening (for which they may be paid up to $25–$30; Lanaj et al., 2018). But an alternative research question may acknowledge the working status of both the employee and spouse (e.g., explicitly focusing on dual income couples and asking the spouse to respond to one or more surveys at work). In this case, both individuals would be recruited as primary participants in the study—a design that necessitates further discussion.

To begin, a question that arises is deciding what the unit of analysis is (i.e., the individual or the dyad). If it is the dyad, then analytically this design is essentially identical to the one where spouses completed only a single evening survey, but instead of paying spouses $25–$30 for those reports, they will have been paid $60–$70. This design thus doubles the top-line number, as well as the sample size. This is not inherently problematic if the budget can accommodate it—indeed, many relationships that occur daily are small, so the added statistical power may be beneficial. In fact, this thinking can be leveraged to increase sample size because a problem with experience sampling can be finding enough people to participate. If a sample is found, then recruiting a working spouse (or even a coworker) to participate as well is an effective way of increasing the sample size. Note that analytically, the employee and other (spouse or coworker) likely cannot be treated as fully independent in the analysis, as they are clustered in their own dyad (i.e., both people can be an “employee” analytically, but as a spousal couple or coworker dyad, they are not entirely independent). This can be addressed analytically with a cross-nested approach or clustered standard errors that account for the nonindependence (for an example; see Yoon et al., 2021).

Resources (Nonmonetary)

Resources reflect more than money, however—indeed, the above discussion also reveals that accessing participants willing to do an experience sampling study is critical. Some authors have used university staff as participants (e.g., Ilies et al., 2010; Koopman, Rosen, et al., 2020). Others use social media to recruit either from their own personal and professional networks (e.g., Butts et al., 2015), or from particular interest groups (e.g., Gabriel et al., 2020). For business schools, a fruitful source is part-time MBA students (e.g., Rosen et al., 2021), particularly for questions involving leadership processes (e.g., Johnson et al., 2014). Plus, using an MBA sample has the added benefit of reducing the financial burden of the study. Beyond that, industry contacts can be quite helpful (e.g., Liu et al., 2017; Spieler et al., 2017; Tang et al., 2022; Uy et al., 2017).

One resource that may be overlooked is having a survey platform to host the study. Many universities have contracts with platforms such as Qualtrics or SurveyMonkey, but even these can vary based on how many surveys can be hosted, how many responses can be obtained, and other relevant considerations. Any limitations of the university contract can create barriers to running these studies—an issue that compounds if the university does not have a contract at all. There are also specific platforms geared towards experience sampling studies such as Expiwell, Piel Survey, or Movisens. However, discussion of these can blur the line between monetary and nonmonetary resources. These platforms can be expensive but can simplify some aspects of running an experience sampling study and allow researchers to collect more than just survey data.

Study Waves

The vast majority of experience sampling studies are conducted in a single wave, which greatly simplifies the administration of the study. But this is not an operational requirement—large experience sampling datasets can be compiled from separate “mini” ESM studies conducted serially with different groups of participants (i.e., the participants are different across studies, but they are participating in the same “study,” albeit at different chronological points in time). For example, the first set of fifty participants complete their surveys during the first three weeks of the study; the second set of fifty participants complete their surveys from the third to fifth week of the study; the third set of fifty complete their surveys from the fifth to seventh week of the study, and so forth. This would be most useful when researchers want to assess not only the “typical” experience of a group of individuals, but how the nature of this relationship might change in response to an ongoing phenomenon. Fu et al. (2021) adopted this technique (what they termed a shingled ESM approach) using four waves of experience sampling data with a one week overlap in between (so Wave 1 was weeks 1 to 3; Wave 2 was weeks 3 to 5, and so forth) in their examination of employee stress responses to the ongoing COVID-19 pandemic.

Use of Other-Reports

There may be good reasons to use other-reports when feasible. For one, introducing another source is a way to alleviate some common-source variance concerns associated with a common rater (Podsakoff et al., 2003)—though, we have more to say on this point in the subsequent paragraphs. Additionally, others can be used to report on some stable aspect of the job that is relevant to the focal employee’s responses (e.g., as a between-individual moderator; Gabriel et al., 2019). Indeed, others have been argued to potentially be a more valid source of measuring employee personality (e.g., Oh et al., 2011). A practical rationale is that it can be desirable to show that the effects being hypothesized are observable in the “real world” (e.g., that employees are noticeably engaging in citizenship, that task performance is higher, etc.). Thus, for scholars used to seeing other-reports in papers on task performance (e.g., Le et al., 2011), citizenship (e.g., Halbesleben et al., 2009), deviant behavior (e.g., Deng et al., 2018), work–family conflict (e.g., Wilson et al., 2018) and so on, it may appear questionable to see experience sampling papers published using self-reports (e.g., Guarana et al., 2021; Liu et al., 2015; Mitchell et al., 2019; Rosen et al., 2021; Sherf et al., 2019). This point is understandable.

Yet it is not constructive to suggest that experience sampling studies that do not include other-reports are lacking in rigor. Rather there are theoretical, empirical, and logistical issues that are often ignored for what at times seem to be a broad critique of experience sampling in general rather than a pointed focus on whether other-reports are appropriate for the phenomenon, research question, and sample at hand. One response to this criticism that has been addressed in prior reviews of experience sampling is that while other-reports are often used in between-individual research to reduce common-source variance attributed to response tendencies, this issue is not as germane for experience sampling data (for an explanation of this point, see Beal, 2015; Gabriel et al., 2019). Beyond this, there are other features of experience sampling which may make the use of other-reports deficient that have not been fully elucidated to this point.

To illustrate, consider the following. Imagine you were asked to report on how much a colleague helps others, in general. You could likely answer this question easily, based on both your own experiences with that person and what you have heard from other colleagues. This is precisely how such a question is intended to be answered. But now extend the thought experiment. Think how well you could answer that question about how much this colleague helps others with the following prompts: over the past six months; over the past one month; over the past one week; over the past day; and over the past four hours. With each successive narrowing of the time frame, you may have found it slightly harder to accurately assess your colleague’s helping behavior as your opportunities to observe that person correspondingly narrowed. Perhaps you have been with that person all morning and could easily report on their behaviors either today or in the past four hours (as would be typical prompts for an experience sampling study). Yet ask yourself—could you have accurately assessed that behavior each day for the past five to fifteen days? Likely the answer is no, as you probably have separate offices and are often engaged in noninterdependent work. Now multiply that by the 80–100 or so employees in the typical experience sampling study and the problem with other-reports should begin to become clear. Simply put, a given coworker’s ability to assess an employee’s behavior over short periods of time each day for one to three weeks in a manner that adequately covers the construct’s content domain is questionable.

This has the potential to be problematic because study participants often want to be helpful (or are reluctant to not at least attempt to provide information as they wish to be paid). Thus, even if they have not observed the employee whose behavior they are supposed to report, they may still answer the questions. For example, for measures of helping, a participant could write something to the effect of “while I haven’t seen Jane today, she is the most helpful person in the office,” and rate each of the items as a “5.” The reverse happens for deviance (i.e., “Jane would never do these things”) along with ratings of “1” for each item. It goes without saying that, as the coworker has not observed Jane today, those reports may or may not be accurate for that day—Jane could have conformed to these typical modes of conduct or had an “off” day that resulted in her being less helpful than usual (and perhaps cyber-loafing more, or being more uncivil than she usually is). Yet if such responses are present in the data, this can make the measurement of the construct invalid (Gabriel et al., 2019).

Now to some extent, this situation described above can be mitigated by asking coworkers about the extent to which they have observed the employee that day, as these cases can perhaps be eliminated from the data. Open ended questions can also be of help here, as illustrated in the example about Jane. However, even this can be unclear because it is difficult to define all the ways in which people become aware of a person’s behavior. Perhaps the coworker did not explicitly observe the focal employee’s helping or deviant behavior, but a third coworker mentioned it. This may be a valid case that could otherwise get screened out. There is also nuance when a coworker has had low to moderate opportunity to observe the employee. At this point, it may be quite difficult to ascertain whether those opportunities correspond with the employee having an opportunity to enact the particular behavior (e.g., helping). For example, those observations could come while the two people were on break, or while the employee was being helped with a task by the coworker. Thus, while the coworker has observed the employee for some time, there was no opportunity for the employee to enact helping behavior of their own. If that coworker responds with “1s” for those questions, the report’s accuracy is unknown. A final concern here is that “opportunity to observe” is not synonymous with “ability to recall.” It is fair to question how much attention people really pay to the behaviors of others at work. Just because a coworker has been in the presence of the focal employee for four hours does not mean that the employee’s behaviors during that time are salient when the coworker is asked to report on them.

Also note that opportunity to observe can also be confounded with the research question. Consider a study on how workload influences subsequent behavior. If an employee has high workload, they may stay isolated in their office, which may reduce opportunities to be observed by their coworker. There are other potential problems here as well. For example, citizenship behavior has been argued to be driven, in part, by impression management motives (which also may vary daily; Klotz et al., 2018). Thus, whether one’s citizenship was visible to others on a given day may be driven by unmeasured variables. Conversely, consider deviant behavior, which has been shown to be often unobservable by others (Carpenter et al., 2017). If the designated other-reporter has not seen this behavior from the focal employee, then it becomes difficult to know if that was because the employee did not engage in that behavior, or just did not see it because this behavior is often not observed. For these reasons, using an other-report has the potential to introduce considerable heterogeneity into the responses that can result in both Type I and Type II errors.

The best scenario for a coworker report is a situation in which the researchers can know for certain that the employees and coworkers have plenty of opportunities to observe each other. For example, Trougakos et al. (2015, p. 230) noted that coworkers should work in “close physical proximity” to the employee (see also Tang et al., 2020). However, such conditions are rare, and reviewers should not expect that authors will have this level of knowledge and control over the ability of coworkers to provide accurate reports. Note also that this situation is likely even more fraught for supervisor reports, as employees and supervisors may often go multiple days without direct interaction. However, this is likely less of a concern regarding reports from one’s spouse/partner with whom they cohabitate, as these individuals do often interact for multiple hours each day (though this is not a given and can be assessed—see our upcoming discussion on this point in the ‘Experience Sampling Study Administration’ section). Obviously, spouses cannot report on employees’ work behaviors, but if behavior at home might proxy (i.e., if an employee’s helpfulness at home can test the mechanism nearly as well as helpfulness at work), this could be seen as appropriate and thus high in both utility and rigor.

In sum, authors are encouraged to obtain other-reports (coworker, supervisor, spouse, etc.) when possible and appropriate. However, there are aspects of experience sampling that not only make these reports difficult to collect, but also may make those reports less appropriate. Thus, a measure of skepticism regarding their accuracy is generally warranted and any perspective that experience sampling is less robust if other-reports are not obtained should be balanced against the theoretical and logistical challenges of obtaining those reports. Ultimately, it is the research question and study context that should drive empirical decisions, and it is often a mistake to discount the one person present each time a behavior in question was enacted—the actual focal employee. Often, they are the only person capable of answering particular questions (for exemplars of this argument, see McClean et al., 2019; Puranik et al., 2021). But even when behavior is potentially observable to others, the discussion to this point reveals the myriad theoretical and logistical complications that arise from using other-reports. For this reason, reviewers and editors are encouraged to recognize that adding other-reports (a) adds to the cost of what is already an expensive data collection method, (b) may not solve an empirical problem, and (c) provides data that may be difficult to interpret, is potentially construct-invalid, and may introduce additional sources of error. Other-reports should be seen as a more conservative test of a hypothesis when present (if they are appropriate to the question and context), but their absence should not, by default, be used as evidence of lack of rigor—rather, study design should be evaluated holistically based on its ability to address the research question.

Experience Sampling Study Administration

The next section presents some very detailed logistical aspects of experience sampling that do not necessarily make it easier to run these studies but should improve data quality.

Sample and Screening

Sample selection issues are not unique to experience sampling—no matter the study design, the sample must be appropriate to the phenomenon of interest and there needs to be an expectation that there will be sufficient variance to analyze. However, some of these issues can become trickier with experience sampling. Consider a study on surface acting towards coworkers. First, if the recruitment process is relatively open (e.g., recruiting through some type of list-serv, social media, alumni network, etc.), it is possible that employees who tend to work on solitary tasks and only interact with coworkers sparingly during the week could sign up. These individuals may be acceptable in a between-individual study that asks the extent to which participants surface act towards coworkers “in general” or even “over the past week,” as they may have sufficient opportunity over those time frames to engage in that behavior, and some employees may be more likely to surface act than others (thus, there should be sufficient between-individual variance). However, an experience sampling study could be problematic, as there may be multiple days in which employees do not interact with their coworkers.

The above scenario can easily occur, as participants can be dispersed among different jobs, organizations, and geographies. Thus, sign-ups should be screened. For example, it is possible that the relationships of interest could differ for workers who are employees of an organization versus independent contractors. If this is a potential concern, independent contractors should be screened out in the sign-up process, or at minimum this information should be collected and examined during data analysis. The same goes for factors such as frequency of coworker or supervisor interactions. Other issues to consider include whether a person has one versus multiple supervisors, whether people work in-person versus remote versus a mix, how many hours per day employees work (and is this stable versus variable), and so on. Researchers must be thoughtful about the sample and how well it fits with the design and research questions. Otherwise, participants can be frustrated by repeatedly responding to inapplicable questions, and researchers will spend money, effort, and time to get data that cannot (or should not) be used.

The point about initial screening also applies to the recruitment of the person providing other-reports (if applicable). If the study design involves a coworker-report, this is the time to emphasize the necessary characteristics of this person (e.g., someone the employee sees or talks to on a regular basis, someone on their work team, their most frequent collaborator at work, etc.). Not all employees will follow these requirements in choosing that person, but this is still a useful place to start. Similar wording may be relevant if asking for supervisor-reports, as employees might have more than one person whom they might consider a supervisor. Whomever the source of the other-report, researchers should be the ones to send the recruitment email instead of providing a link for the employee to pass along, as this eliminates an opportunity for the employee to simply do the survey themselves as the designated other.

Sample Information

At this point, it is important to ensure participants’ schedules fit the design of the study, though note that this part is less important if participants are known to have uniform schedules. That is, researchers should confirm that participants will be where they “should” be when surveys are sent (i.e., at work if the survey should be completed at work, or at home if the survey should be completed at home). There are different ways in which this information can be both collected and implemented in the study design, and these various options are well-represented in the extant literature. It may be enough to ask participants to confirm they work typical hours (i.e., arriving at work between 6 a.m.–10 a.m., leaving work between 3–7 p.m.). Alternatively, participants can be asked to indicate their typical schedule in discrete windows (i.e., asking them if they arrive between 6 a.m.–6:29 a.m., 6:30 a.m.–6:59 a.m., 7 a.m.–7:29 a.m., and so on). This may seem hyper-specific, but there are some benefits (see the next section on ‘Survey Planning’) though this does not imply that studies that do not take this approach are in any way flawed.

Expectations should be reinforced when obtaining this information (i.e., being very explicit about the fact that they will be asked to complete one or more surveys per day for a number of days). To this point, it may be helpful to explain what each daily survey is doing (e.g., “Daily Survey #1 is intended to capture your experiences before starting work for the day,” “Daily Survey #2 is intended to capture your experiences over the first half of your workday,” etc.). By explaining the purpose of each survey, participants may better understand why the surveys are being sent at different times and could even motivate them to complete their surveys as it treats them as a partner in the data collection process (Gabriel et al., 2019). If the design entails surveys to be completed at home each evening, the participant should be asked if they prefer a different email for that survey as not everyone checks their work email at home.

Survey Planning

Some experience sampling designs send surveys with relatively broad availability windows (e.g., 6 a.m.–10 a.m., 10 a.m.–2 p.m., 3 p.m.–7 p.m.). This design is easy to implement, however, there are potential drawbacks. Imagine two participants (one of whom arrives at 7:30 a.m. and the other at 9 a.m.) who each complete that first survey at 9:30 a.m. If that survey intends to capture experiences that happen early in the workday, then this worked for the first participant, but not the second. However, this can be screened for by asking participants when they arrived at work, which allows it to be accounted for during data analysis (i.e., controlling for time at work, or creating a decision rule (say, one hour) and excluding cases for those who have been at work for shorter). Another potential problem with this design arises if a participant does the first survey at 9:30 a.m. and the second survey at 10:15 a.m., as not much time has passed.

An alternative is to create semi-customized schedules for each participant. This requires detailed information about each participant’s day (i.e., it would be important to ask the hyper-specific questions noted in the ‘Sample Information’ section). Also, this requires recognizing that each participant’s day is not necessarily the same (i.e., people may work four days per week, have standing meetings Tuesday at 2 p.m., or have an appointment next Friday at 9 a.m.). Thus, beyond getting the information about each person’s typical schedule, it would be necessary to ask participants about idiosyncratic conflicts during the study window. With this information, it is possible not only to place participants in groups that receive their survey links at different time points (i.e., Group 1 receives their link at 6:45 a.m., Group 2 receives their link at 7:15 a.m., and so on), but also participants can be moved between different groups based on their idiosyncratic conflicts (i.e., Sarah is normally in Group 1, but next Wednesday she will get to work later than normal and so she will be in Group 3 that day).

Regardless of how surveys are scheduled, they should be set to deactivate within a specified amount of time to prevent surveys from being completed out of order (Daniels et al., 2009). Consider a study in which links to the daily surveys are sent at 8 a.m., 12 p.m., and 4 p.m. If the window is the same for each person, the survey has to stay active during the entire window but can at least be closed before the next survey is to be sent. But if surveys are customized to participants, the window can be tightened. For example, the participant can be given a specified amount of time (e.g., two hours) to complete each survey. As such, this participant has until 10 a.m. to finish the first survey, while a different participant who does not arrive at work until 9:30 a.m. would have until 11:30 a.m.

Measure Selection

Experience sampling research has tended to use shortened measures for the daily surveys. This practice originated from the need to keep surveys as short as possible given the repeated measures nature of experience sampling and the fact that items which may make sense for a scale that assesses someone’s behaviors in general (i.e., “goes out of the way to make newer employees feel welcome in the work group”; Lee & Allen, 2002) are not applicable to repeatedly assessing a person’s behavior over the past four hours each day for three weeks. Importantly, the majority of experience sampling research does report estimates of measure internal consistency (Gabriel et al., 2019), and those measures are generally held to the same standard as other research, which suggests that even the typically shortened measures often used with experience sampling are generally reliable. One point to note though is that internal consistency cannot simply be calculated directly from the raw measures due to the nonindependence of the reports. One approach that has been common in the literature is to calculate a daily internal consistency value (as each person should only have a single value per day) and take the average. A more recent alternative that Gabriel et al. (2019) have called attention to, however, is to calculate this value strictly from the within-individual variance of the responses (Geldhof et al., 2014).

Confirmatory factor analyses are also generally presented and held to similar standards as with any other research. Yet these analyses, even if supportive, do not preclude the potential for the selected items to be deficient in terms of capturing the content domain for the construct. One option to alleviate this concern is to follow an approach used by Rosen et al. (2019) in which a supplementary study is conducted that administers the whole measure to a sample and examines the correlation between the short- and long-form. Also, a new procedure from Cortina et al. (2020) provided a more data-driven process for shortening measures. Again, however, the research question should ultimately be the primary driver of these decisions.

Scale Anchors

While it may seem desirable to leave anchors as they were originally specified for the scale, anchors do need to be tailored to the context in which the measure is to be used and in line with the temporal frame of the study. One option is to change anchors to agree/disagree. An alternative sometimes used for behavioral measures specifically is to use counts (i.e., 1 = “0 times,” 2 = “1 time”), or to use anchors such as 1 = “not at all” to 5 = “very much.” What is critical is that authors be thoughtful about how the anchors fit with the context and research question at hand.

Important Issues Post Data Collection

Data Structure

Recall that number of surveys per day and the frame of reference for the measures in those surveys lead to how the data is subsequently structured and described. Imagine 100 participants complete three surveys per day for 15 days: 4,500 surveys. However, 4,500 is not necessarily the final sample size—this depends on the research question and analytic needs. First, there are actually two sample sizes that need to be discussed. The first is the number of participants, which in this example is 100. This is referred to as the “Level 2 N.” The second refers to the number of cases for analysis, which will be referred to as the “Level 1 N.” If each survey were to provide the full information needed for the analysis, then the Level 1 N would be 4,500. But this is not common—typically, each survey provides one or more variables for the overall model (i.e., some experience at T1 influences some state at T2, which influences some behavior at T3). In this situation, a complete set of surveys is needed to create a day-level case for the analysis, in which case the Level 1 N is a function not of the number of surveys, but the number of study days (1,500).

Even then, the final usable sample size is not guaranteed to be equal to the total complete days available. For example, if lagged variables are needed (for example, the independent variable was collected the previous night), the sample size might be considerably reduced. Consider a case where a participant fills out all surveys on days 1, 3, and 4. Although three days of data are present, only one case (that was collected on day 4) has the necessary lagged variable from the previous day. A full set of surveys completed on days 1, 2, and 3 would provide two valid observations, while a full set of surveys completed on days 1, 3 and 5 would provide none.

Data Screening

Data screening is not unique to experience sampling; however, given the amounts of money that participants can earn for these studies, the incentives to cheat may be higher. If the recruitment method is relatively open and participants are signing up with web-based email addresses (e.g., Gmail, Yahoo, etc.) then it is critical that researchers have some means for verifying that participants are both unique and who they say they are (for example, employed adults). One option is to not use electronic payment methods (i.e., gift card codes) for these samples. Instead, requiring a home address to send a check or gift card code can make it more onerous for people to register multiple times. Alternative options involve either calling participants to verify their identity (e.g., Greenbaum et al., in press), or requiring participants to sign a tax document (e.g., W-9) to validate their identity—something that is not prohibitive with the increasing availability of electronic document signing programs (and may even be required by some universities). These concerns are lessened for samples that rely on participants from a known entity (i.e., company, university, etc.) and where participants use their company email addresses. Similarly, this is not feasible for online subject pools, as anonymity is fundamental to their design. Additionally, these sites have processes in place to ensure that individuals register only once, though this does not obviate the need to carefully screen those sign-ups. Other methods of data screening such as those suggested by Meade and Craig (2012) can be implemented at this stage as well.

Missing Data

Missing data is commonplace with experience sampling studies. Some “missingness” exists simply due to participants not completing any surveys on a given day. In general, this is not a problem, though some have advocated to only retain people who completed some minimum number of days of the study. Gabriel et al. (2018, p. 92) spoke directly to this point, explaining that “three data points per person are statistically needed to appropriately model within-person relationships.” The point is that when analyzing nested data (which is how these data are structured; Bryk & Raudenbush, 2002; Dimotakis et al., 2013), each person has a regression line representing their intra-individual relationship between the variables. A line cannot be drawn with one case, and for two, that line would fit the data perfectly and thus have no statistical error. Therefore, three has often been used as a cut-off, though there is variance in both whether studies implement a cut-off, as well as what that cut-off is (i.e., there is a trade-off between retaining more data overall versus increasing the stability of the intra-individual relationship by only keeping data from participants who provided a greater number of cases).

Missing data can also occur when participants complete some but not all of the surveys on a given day. For example, participants may get to work late and thus miss a morning survey or might leave work particularly early or late and miss an evening survey. This could also come from a coworker who does not complete a survey (or did not interact with the participant that day). How this data is treated depends on where it is positioned in the research model. It is most common (though not universal) to specify main hypothesized paths with random slopes as can be seen in supplemental syntax provided by Gabriel et al. (2019). If this analysis is conducted as a path analysis (e.g., with Mplus), it will not run with missing data on a predictor (i.e., independent variable or mediator). For missing data on a dependent variable (or, for variables modeled with fixed slopes, which is typical for control variables), the analysis will by default be conducted with a “full-information” algorithm (Newman, 2014). This is preferable to listwise deletion (e.g., Enders, 2010; Graham, 2009; Larsen, 2011), and so has become a common way to handle this situation (e.g., Hill et al., 2020; Lennard et al., 2019, in press; McClean et al., 2021; Rosen et al., 2021; Tang et al., in press; Trougakos et al., 2015).

Some Remaining Points

Space and the specific goals for this chapter preclude delving into a discussion of data analysis and the myriad associated questions. Experience sampling study designs afford considerable flexibility with regards to the hypotheses that can be tested and the ways in which the data can be analyzed. For example, the data can analyzed at the within-individual level, aggregated to the between-level (typically as a central tendency, but can also be used as an estimate of dispersion), leveraged to examine cross-level main or moderated effects, be used to investigate period-to-period change, or even be utilized to create mini-time series that describe an underlying phenomenon at given points in time (Fu et al., 2021; Matta, Sabey, et al., 2020; Rosen et al., 2020). Note that unless the data are fully aggregated to the person-level (i.e., if the data set maintains its two-level structure), then the data are nested and thus some accounting for this nonindependence is necessary with the analytical approach. A number of sources have spoken on these issues (e.g., Beal & Gabriel, 2019; Bryk & Raudenbush, 2002; Gabriel et al., 2019), and there are many published exemplar papers that can be consulted. Table 1 provides a non-exhaustive list of suggested items to include in the method section of an experience sampling study. What is critical is that the analysis must fit the research question. Experience sampling is a tool, but theory must guide its application. That said, there are two final issues to consider.

Table 1. Suggested Information for Experience Sampling Method Sections



Sample source

How/where sample was identified, recruited, etc.

Sample information

Demographics, etc.


Amount and scheme (e.g., $1 per survey, a bonus for completing all surveys, etc.)

Sample size

Final Level-1 and Level-2 and how that number was reached


Timing of when surveys were sent and completed


Item source, numbers, explanations for adaptations, full wording, etc.


Evidence for measure appropriateness (e.g., reliability, CFA, etc.)


Percentage that is within and between

Analytic Strategy

Model specification, centering/variance partitioning decisions, approach to hypothesis testing, etc.

Experience Sampling or Longitudinal/Latent Growth?

We discussed in the earlier section on ‘Repeated Measures’ that a core assumption behind experience sampling is that the data collection occurs during a “typical” window in people’s lives. This property is critical because otherwise group-mean centering (the typical centering approach used for experience sampling data; Dimotakis et al., 2013) would technically violate causality, as responses from later points in the study influence the deviation of prior responses from the person’s “typical” level. Thus, a critical assumption is that the individual and the within-individual phenomenon under investigation is not inherently changing over the course of the study (i.e., there are fluctuations, which is what makes the phenomenon within-individual, but there is no growth or change). For this reason, longitudinal or latent change approaches are inappropriate for experience sampling data, and reviewers should neither request these methods be employed, nor suggest that the analysis is otherwise flawed for not using these approaches.

In contrast, if the phenomenon is growing or changing, the typical analytic approach for experience sampling data would be inappropriate, as group-mean centering would remove variance that has not yet occurred, which is obviously problematic. In this situation, the data should be examined with an approach that is explicitly designed to detect and predict change or growth patterns. To put it simply, if the research question involves fluctuations around a stable mean, group-mean centering, variance partitioning, and other such typical approaches to analyzing, then experience sampling data should be used. If instead growth is the effect predicted, longitudinal or other change methods should be used. These approaches are not substitutable.

Effect Size and Variance Explained

It is good practice to calculate variance explained in the endogenous variables. However, the effect sizes and corresponding percentages of variance explained in experience sampling are often somewhat small. For this reason, the practical impact of the research could be questioned. Yet this omits the critical realization that this question involves a relationship that occurs between two constructs over a very short period of time (e.g., over the past several hours, right now, etc.). As already discussed, experience sampling studies are conducted during normal and common periods of people’s work lives. It is unlikely that such studies would have large relationships—imagine if every time the person in the office next to you was in a better mood than usual, they immediately came to your office and tried to help with the analyses that you were running. Alternatively, imagine that each time that person felt more angry than usual, they walked office-to-office being rude and uncivil.

These examples are extreme, but illustrate the point. There are myriad contextual and idiosyncratic influences on what employees experience, feel, or do in the tight windows examined in experience sampling that are averaged out when examining the same phenomenon over longer periods of time (e.g., over the past week, over the past month, or in general). Over relatively short periods of time during generalizable periods of an employee’s life, the norm should be small effect sizes, and even here, note that the word “small” may be inappropriate because it is inherently comparing the size of experience sampling effects against those derived from studies with completely different designs (Bosco et al., 2015; Cohen, 1992).

To the extent that these effects are considered small, it should be noted that, for example, the correlation between antihistamine usage and symptom relief is .11 and that between anti-inflammatory drugs and pain relief is .14 (Funder & Ozer, 2019). This leads us to the more important question to ask—not whether these effects are small, but whether they are meaningful. Funder and Ozer (2019, p. 161) provide the statistics noted above in service of making the argument that if a relationship that occurs over a short time period can be reliably shown, “its influence could in many cases be expected to accumulate into important implications over time or across people even if its effect size is seemingly small in any particular instance.” If theoretically-derived relationships can be found in just a few hours, the fact that employees spend upwards of 2,000 hours at work each year suggests the relationship may potentially compound and be important in the long-run (Abelson, 1985; Cascio & Boudreau, 2008; Prentice & Miller, 1992).

In sum, it is good practice for authors to report some index of effect size (e.g., variance-explained) for their hypothesized relationships. Regardless, editors and reviewers are encouraged to refrain from criticizing the size of those effects for not meeting their own implicit standard of how large an effect “should be” in order to be practically meaningful. Often, this criticism is wielded as a bludgeon against which authors have no meaningful counter, given that effect sizes viewed as small are the norm with this research. At a minimum, effect sizes for experience sampling studies should not be compared with effect sizes derived from experiments or studies with a longer frame of reference for the relationships under investigation. These effects are not comparable and so holding experience sampling effects sizes to this standard is unreasonable. Instead, a study that would be helpful is one similar to that of Bosco et al. (2015) which categorizes the types of relationships that have been examined and empirically indexes “typical” effect sizes.

Further Reading

  • Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.). Lawrence Erlbaum.
  • Dalal, R. S., & Hulin, C. L. (2008). Motivation for what? A multivariate dynamic perspective of the criterion. In R. Kanfer, G. Chen, & R. D. Pritchard (Eds.), Work motivation: Past, present, and future (pp. 63–100). Taylor & Francis Group.