Show Summary Details

Page of

Printed from Oxford Research Encyclopedias, Business and Management. Under the terms of the licence agreement, an individual user may print out a single article for personal use (for details see Privacy Policy and Legal Notice).

date: 12 May 2021

Experimental Designs in Business Researchfree

  • Heiko BreitsohlHeiko BreitsohlDepartment for Personnel, Management, and Organization, University of Klagenfurt


Conducting credible and trustworthy research to inform managerial decisions is arguably the primary goal of business and management research. Research design, particularly the various types of experimental designs available, are important building blocks for advancing toward this goal. Key criteria for evaluating research studies are internal validity (the ability to demonstrate causality), statistical conclusion validity (drawing correct conclusions from data), construct validity (the extent to which a study captures the phenomenon of interest), and external validity (the generalizability of results to other contexts). Perhaps most important, internal validity depends on the research design’s ability to establish that the hypothesized cause and outcome are correlated, that variation in them occurs in the correct temporal order, and that alternative explanations of that relationship can be ruled out.

Research designs vary greatly, especially in their internal validity. Generally, experiments offer the strongest causal inference, because the causal variables of interest are manipulated by the researchers, and because random assignment makes subjects comparable, such that the sources of variation in the variables of interest can be well identified. Natural experiments can exhibit similar internal validity to the extent that researchers are able to exploit exogenous events creating (quasi-)randomized interventions. When randomization is not available, quasi-experiments aim at approximating experiments by making subjects as comparable as possible based on the best available information. Finally, non-experiments, which are often the only option in business and management research, can still offer useful insights, particularly when changes in the variables of interest can be modeled by adopting longitudinal designs.


Experimental designs and, more generally, research designs are core elements of research in business, management, and many related fields (see the first chapter in both Morgan & Winship, 2015, and Shadish et al., 2002, for brief historical overviews). Adopting an appropriate research design is essential for being able to draw valid conclusions about the phenomenon of interest. Importantly, this is relevant not only from a research perspective (Shadish et al., 2002), but also for business and management practice. Making effective managerial decisions requires understanding the potential consequences caused by those decisions, based on the best available evidence (Antonakis, 2017; Barends & Rousseau, 2018; Podsakoff & Podsakoff, 2019). The trustworthiness of such evidence depends heavily on the research design applied in the studies producing the evidence. This article offers an introduction to experimental designs and other research designs, starting with a brief overview of basic types of research designs and settings. The following section describes conditions necessary for drawing valid conclusions from research studies, with an emphasis on conclusions about causality. These provide the conceptual background for the subsequent sections detailing key elements and types of common designs, namely experiments, natural experiments, quasi-experiments, and non-experiments. The final section briefly addresses some available options for data analysis and highlights some practical considerations.

Fundamentals of Research Design

Basic Types of Research Designs

Research designs can be broadly distinguished according to the extent to which the researchers control the sources of variation in the causal variables, that is, the decisions, interventions, or “success factors” (Barends & Rousseau, 2018) suspected to have an effect on important, measured outcomes. These distinctions, in turn, are associated with specific strengths and weaknesses (see Table 1). Control over sources of variation, that is, the processes creating variation (Ketokivi & McIntosh, 2017), depends on two main attributes of research designs: the extent to which the value on the causal variable (independent variable) of each subject (e.g., employee, team, firm) is determined randomly, and if (and how) that value is manipulated by someone other than the subject (Podsakoff & Podsakoff, 2019). These attributes have important implications for the abilities of the research designs to identify causal relationships, and thus provide trustworthy information for managerial decisions.

Table 1. Basic Types of Research Designs

Type of Design

Causes (Independent Variables)

Outcomes (Dependent Variables)

Selected Strengths

Selected Weaknesses

Determination of Value

Process of Assignment



Manipulated by researchers


Strong causal inference due to randomization and control

Manipulations in the laboratory may be perceived as artificial

Challenging to gain access in field settings

Natural experiment

Random or as-if-random

Manipulated by exogenous events or decisions


Moderate to strong causal inference

Potentially strong construct validity

Opportunities may be very scarce for many research questions

External validity may be limited due to unique setting



Self-selected by subjects or exogenously selected


Moderate to strong causal inference

Easier to conduct in organizations than experiments

Challenging to conduct in field settings due to difficult access

Requires careful consideration of alternative explanations



Measured “as is”


Relatively easy to conduct

Weak causal inference

When the values of the causal variables are merely measured “as is” (like those of the outcome variables), which is very common in business and management research (Podsakoff & Podsakoff, 2019), this is called a non-experimental design. In such a design, the researcher has very little control over the sources of variation in the variables of interest. By comparison, when there is information available indicating that subjects were selected exogenously (i.e., not by the researcher), or self-selected, into the values of the causal variables, albeit nonrandomly, this information can be used to construct a quasi-experiment, which is generally an improvement over the non-experiment. In some cases, the exogenous manipulation of the causal variable happens in a random or as-if-random fashion, which constitutes a natural experiment. While this is relatively rare, it can be very helpful in controlling sources of variation. Finally, if the researchers are able to directly control variation in the causal variables by manipulating them in a randomized process, they are in the attractive position of conducting an experiment, which offers the highest the degree of control. These basic designs are discussed further in later sections of this article.

Research Design and Causal Inference

The core purpose of research design is to enable researchers to make valid inferences from the particulars of their study to the more general explanation they aim to test by conducting the study. Ideally, a research design will be strong with the respect to four types of validity (Shadish et al., 2002; Stone-Romero, 2011), each of which addresses a different kind of conclusion: internal validity, statistical conclusion validity, construct validity, and external validity. Perhaps the most important of the four, internal validity, is the extent to which the study allows inferring causality, for example, that variable X is the cause of variable Y. In the classic framework by Shadish et al. (2002), there are three requirements for valid causal inference of an effect of X on Y:


Correlation: Variation in X is associated with variation in Y.


Temporal precedence: Variation in X occurs before variation in Y.


Absence of alternative explanations: The association between variation in X and variation in Y cannot be explained by anything other than a causal relationship.

The internal validity of a research design thus depends on the extent to which these three requirements can be satisfied. Generally, the first requirement is relatively liberal, as establishing a correlation between two variables is straightforward, given adequate construct validity (i.e., a well-functioning measure for each variable) and statistical conclusion validity (i.e., appropriate data-analytical techniques matching the characteristics of the data; adequate sample size). Indeed, all research designs discussed in the following sections are generally able to satisfy the requirement of correlation.

The second requirement tends to be more challenging, as it entails ensuring that variation in X occurred before variation in Y. This goes beyond mere correlation, because X and Y may also be correlated due to variation in Y occurring before variation in X. In research practice, common approaches to establishing temporal precedence involve measuring Y after variation in X has been purposely induced by the researcher (i.e., in an experiment), exogenously induced by some other source (i.e., in a natural experiment or quasi-experiment), or measured (i.e., in a non-experiment). In this order, temporal precedence is less and less trustworthy, as the true origin of variation in X is more and more ambiguous. Thus, one of the strengths of experiments over other types of designs is the ability to establish unambiguous temporal precedence.

Yet, the most challenging requirement is the third, because a host of alternative explanations for the relationship between X and Y may be available. In general terms, even if the temporal ordering of X and Y is correctly identified, the two variables may be correlated due to other (unmeasured) variables that are, in turn, correlated with both X and Y. This problem has been discussed using a variety of labels, such as confounding variables (Crano et al., 2015) or endogeneity (Antonakis et al., 2010; Hill et al., 2021). The problem can manifest through a number of mechanisms that provide alternative explanations for a causal effect of X on Y. Selected examples are presented in Table 2; for a more comprehensive discussion see Shadish et al. (2002), Podsakoff and Podsakoff (2019), or Crano et al. (2015). A strength of experimental designs is their ability to rule out alternative explanations, through two mechanisms: randomization and control. Randomization implies that any alternative explanations pertaining to the subjects (e.g., selection) can be ruled out because, on average (given a sufficient sample size), the only difference between subjects is their difference on the causal variable (X). The unique importance of randomization lies in the fact that it partially relieves researchers from having to understand every relevant alternative explanation, although incorporating specific alternative explanations is still possible (i.e., blocking designs; Crano et al., 2015) Many other techniques focus on ruling out specific explanations based on theory. Control is one such technique, where the goal is to keep all elements of the study setting constant in order to prevent any alternative explanations (e.g., history) from interfering with the outcome. The extent to which control is possible, in turn, depends partly on the setting (see section “Settings for Business and Management Research”), although control is also applied through keeping constant all elements of the experimental manipulation that do not reflect differences in X.

Table 2. Selected Alternative Explanations for Causal Effects, With Examples Based on a Fictitious Employee Training Program

Alternative Explanation



Subjects are assigned to values of the causal variable in some unknown process that deviates from randomness, and are thus not comparable.

Employees voluntarily sign up for a training program, partly based on their individual interests and preexisting skills.

Thus, an apparent difference in skills between participants and nonparticipants after the program may be due to these preexisting differences rather than program effectiveness.


External changes occurring during the course of the study affect the outcome of interest.

Changes in working conditions facilitate improving preexisting skills on the job through repeated practice.

Thus, an apparent difference in skills between participants and nonparticipants after the program may be due to opportunities for on-the-job practice rather than program effectiveness.


Subjects, themselves, undergo change unrelated to the study conditions that affects the outcome of interest.

Training program participants improve their skills through sheer experience outside the program.

Thus, an apparent difference in skills between participants and nonparticipants after the program may be due to increased experience rather than program effectiveness.


Subjects nonrandomly drop out of the study, i.e., drop-outs are different from those remaining in the study

Training program participants for whom program content is less appealing drop out of the program disproportionately.

Thus, an apparent difference in skills between participants and nonparticipants after the program may be due to these individual differences rather than program effectiveness.

Interaction effects

One alternative explanation adds to another, or depends on another

During the course of the training program, some participants acquire the intended skills in some other way (maturation), and then drop out of the program (attrition).

Thus, any apparent difference in skills between participants and nonparticipants after the program may be due to these combined effects rather than program effectiveness.

Conversely, the key weakness of non-experimental designs is their limited ability to rule out alternative explanations, which is why non-experiments are not very informative about causal relationships, although there are great differences among non-experimental designs. As researchers are unable to use randomization or control, ruling out any of the various alternative explanations requires theory and, if possible, valid measures of confounding variables as well as sophisticated statistical techniques (Ketokivi & McIntosh, 2017; Pearl et al., 2016). In other words, while experiments generally function by preventing most alternative explanations, non-experiments necessarily require resorting to repairing issues with internal validity. Situated between experiments and non-experiments, the primary goal of quasi-experimental designs—as the name implies—is to approximate the validity of experiments in situations where randomization is not available, but there is a manipulation of the causal variable occurring. This is done by using several techniques (see the section on “Quasi-Experiments”), sometimes in combination, aimed at making subjects as closely comparable as possible to a control group and/or to themselves over time.

In addition to internal validity, which is the main focus of this article, three other kinds of validities are important: construct validity, statistical conclusion validity, and external validity (see Shadish et al., 2002; Stone-Romero, 2011). Briefly, statistical conclusion validity is the extent to which the conclusions based on any statistics estimated from the collected data are correct. Common issues are violated assumptions of statistical models, measurement error, and low power to detect an effect of interest. See also the section on “Practical Considerations and Data Analysis.” Construct validity is the extent to which the particulars of the study (measures, manipulations, settings, subjects, etc.) represent the phenomenon of interest. Strong construct validity requires that any materials used in the study be well-calibrated, usually though pilot studies or the use of previously validated measures and/or manipulations, and that study settings and subjects be carefully chosen or created. The goal is that the study particulars represent the phenomenon of interest as fully as possible, but no other phenomena, and that this representation does not deviate systematically (i.e., bias) or unsystematically (i.e., unreliability, measurement error) from the focal phenomenon. External validity is the extent to which a study’s findings (mainly causal inferences) are generalizable to other populations, settings, manipulations, or measures. In contrast to the other three kinds of validities, external validity inherently cannot be fully assessed by focusing on the attributes of the study at hand. While the specificity of the particulars of a study may offer some indication of external validity, a thorough assessment requires comparisons across studies. A direct approach would be for the same researchers to conduct multiple studies using different, specific manipulations, measures, settings, and subjects, while a more indirect, cumulative approach is to collect the data from wide variety of existing studies on the same general topic and aggregate them statistically using meta-analysis (Stone-Romero, 2011).

Finally, these four types of validity, collectively, are often referred to as study rigor (Antonakis, 2017). Unfortunately, a common misconception is that rigor is somehow incompatible with relevance to managerial practice, such that the two need to be balanced. A more accurate description is that a study has to be rigorous in order to be relevant (Antonakis, 2017), because if its results are not trustworthy, they cannot be relevant to practice. This is why understanding the designs explained in the following sections is important for both researchers and practitioners.

Settings for Business and Management Research

In addition to designs, research studies can be distinguished by settings. While a common distinction is between laboratory and field settings, a more informative perspective differentiates between special purpose and non–special purpose settings (Stone-Romero, 2011). Special purpose settings are created for the purpose of conducting research, are thus usually temporary in their existence, and allow for a large degree of researcher control over the situation, including manipulation of the causes of interest. This makes them particularly suitable for ruling out alternative explanations. By comparison, non–special purpose settings (e.g., an actual business organization) are taken “as is”. While the commonly resulting lack of control weakens internal validity, these settings tend to possess strong construct validity as they are real instantiations of the phenomenon of interest. Yet, the claim that non–special purpose settings generally have strong external validity (and that special purpose settings do not) is based on the problematic assumption that the specific setting is representative. This is rarely true, however, as most samples in business and management research are non-representative, and external validity is more appropriately assessed through meta-analysis of a larger set of studies (Stone-Romero, 2011).

Broadly speaking, there is some correlation between research design and setting. Special purpose settings tend to be created for experimental designs, precisely because they afford the desired control for establishing causality. In contrast, natural experiments, quasi-experiments, and non-experiments tend to be conducted in non-special purpose settings, for example, because creating a special purpose setting is infeasible or unethical. However, this correlation is not perfect, as sometimes, randomized manipulation is possible in non-special purpose settings. Field experiments are a prominent example combining the strong internal validity of experimental designs with the strong construct validity of field settings, while often being challenging to conduct (e.g., Chatterji et al., 2016; Eden, 2017; Podsakoff & Podsakoff, 2019).


Experiments are characterized by the causal variables of interest being manipulated by the researcher, that is, subjects being exposed to specific values (e.g., low vs. high) of those variables, such that the assignment to the values occurs randomly. The experimental toolbox offers a wide variety of designs that can be adapted to suit the research question and setting at hand (Crano et al., 2015; Kirk, 2013; Maxwell et al., 2018; Shadish et al., 2002; Tabachnick & Fidell, 2007). Broadly, these designs are distinguished by whether random manipulation is conducted between or within subjects, how many causal variables (“factors”) are manipulated, how many different values (“levels”) are manipulated for each factor, and the extent to which all possible combinations of levels of factors are included (complete vs. incomplete designs; Tabachnick & Fidell, 2007).

In between-subjects designs (or randomized-groups designs; Tabachnick & Fidell, 2007), subjects are randomly assigned to separate groups representing different levels of the experimental factors (independent variables). Thus, not all subjects are exposed to the same level(s) and, from an individual subject’s perspective, that level is purely coincidental (and the alternatives are unknown). The causal effects of interest are differences between those groups, specifically the differences among group means on the outcome (dependent) variable(s). Due to random assignment, those differences can, generally, be assumed to be caused only by the groups’ differences on the independent variables, which allows for strong causal inference. In within-subject designs, all subjects are exposed to more than one level of the experimental factor(s). The dependent variables are measured after each exposure, and the causal effects of interest are differences among level-specific means.

The decision between the basic options of between- or within-subjects designs, which must be taken for each experimental factor, is based on a trade-off between their strengths and weaknesses. Briefly, between-subjects designs offer the advantage of avoiding “spillover” effects between exposures to different levels, as every subject receives only one, but require larger sample sizes for the same reason. Within-subjects designs can be applied to smaller samples, as each subject effectively serves as their own “control group,” albeit at the risk of spillover effects. The standard approach to preventing spillover in these designs is to counterbalance, that is, to randomly order, the exposures of the different factor levels (Crano et al., 2015). Designs in which the dependent variables are measured at least twice over time, before and after exposure to the manipulation (pretest-posttest designs or time-series designs; Shadish et al., 2002), are often considered part of this same type of design, with the terms within-subject and repeated-measures designs being used interchangeably (Crano et al., 2015; Tabachnick & Fidell, 2007).

The simplest experimental design, the one-way design, features a single factor, which is not uncommon in business and management research. However, many experiments feature more than one factor, in a factorial design. When all factors in a factorial design are of the same type (between-subjects vs. within-subject), that design is simply labeled with the type name. In contrast, when at least one factor is of the other type, it is called a mixed factorial design. Note that the term “factorial” is sometimes reserved for complete designs (e.g., Tabachnick & Fidell, 2007), in which all combinations of levels of factors are used. Complete designs enable the researcher to test all possible interaction, main, and simple effects, while requiring the largest sample size and, with within-subject factors, careful consideration of potential spillover effects. As the number of levels for each factor can range, in principle, from two to a very large number, that number is limited usually by the research question, the extent to which different levels can be validly manipulated, and the available sample size. Incomplete designs offer a solution to having a large number of “cells” (the possible combinations of all levels of factors) by eliminating specific combinations and thereby forgoing the possibility of testing certain effects, that is, interaction effects of certain orders. For example, a factorial design with three factors could be reduced to an incomplete design in which all effects except the three-way interaction are available, but it could also be further reduced to designs in which some or all of the two-way interactions are unavailable as well. In other words, incomplete designs are attractive if at least some interaction effects are considered negligible (Tabachnick & Fidell, 2007).

The general convention for describing experimental designs is to denote the number of levels for each factor, where the factors are separated by a multiplication sign, and to add further information on the type of design (Tabachnick & Fidell, 2007). However, other conventions are available (Shadish et al., 2002), each emphasizing different design aspects. For example, a randomized design with two treatment groups, one control group, a single pretest, and a single posttest (Shadish et al., 2002) could also be described as a mixed factorial 3 (randomized groups) × 2 (repeated measures) design.

As an example of an experimental design, Study 2 by Grant and Hofmann (2011) randomly assigned subjects to either a control group or one of two treatment groups (i.e., a one-way between-subjects design). The subjects, whose task performance and citizenship performance were to be measured, were exposed to different motivational messages (or no message, in the control group). Through randomization, the researchers were able to rule out any alternative explanations based on subject characteristics, such as selection or attrition (see Table 2). Moreover, as the laboratory setting allowed for tightly controlling many exogenous influences on subject performance, other alternative explanations such as history could be ruled out. Without this degree of control, more elaborate designs are necessary (see the discussion of Study 1 from Grant & Hofmann, 2011 in the section on “Quasi-Experimental Designs”).

The challenges associated with experimental designs typically revolve around creating settings and stimuli that reflect the phenomenon of interest and are strong enough to affect the outcome(s), but do not appear artificial to the subjects. Another common challenge are demand effects, that is, subjects changing their behavior based on what they believe to be the goal of the study (Lonati et al., 2018; Podsakoff & Podsakoff, 2019). When conducted in field settings, experiments typically involve additional challenges pertaining to gaining access to organizations and being able to devote the required time and effort for planning and running the study (Podsakoff & Podsakoff, 2019).

Natural Experiments

In natural experiments, the cause of interest is not manipulated by the researcher but by “nature,” that is, exogenous events or decisions, creating circumstances that approximate an experiment. The two key features of a natural experiment are that, first, variation in the cause of interest is explained by a (quasi-) random process and, second, that the process generating the randomized or as-if-randomized groups is beyond the control of both the researcher (unlike in an experiment) and the subjects (unlike in a quasi-experiment or non-experiment; Sieweke & Santoni, 2020). Typical examples include changes in government regulations, natural disasters, or other processes beyond human control, such as genetics. For instance, investigating the effect of female leadership on girls’ career aspirations, Beaman et al. (2012) took advantage of gender quotas imposed by law, creating quasi-randomized groups of girls exposed to either female or male role models.

From a research-practical perspective, natural experiments are not “planned,” but researchers take advantage of the exogenous event creating (or having created) favorable circumstances for investigating the causal effects of interest. As identifying such events may be challenging, natural experiments are a relevant option primarily when “true” experiments are infeasible. While considered a separate type of design here, partly due to their relevance in business and management research, the methods used in natural experiments are the same as those in quasi-experiments (see the section on “Quasi-Experiments”), depending on the particular type of empirical setting and exogenous event (Sieweke & Santoni, 2020).


Quasi-experiments are typically the strongest available type of design in situations where the cause of interest cannot be randomly manipulated by the researcher, nor is there a “natural” opportunity for (as-if) random assignment (Grant & Wall, 2009; Reichardt, 2019; Shadish et al., 2002; Stone-Romero, 2011). Rather, there is some other process determining the value of the cause of interest exhibited by each subject (e.g., employee, firm). Often, subjects self-select into groups representing different values of the causal variable or are selected based on a nonrandom criterion. For instance, firms may adopt a certain strategy based on changes in their environment, or employees may be signed up for a training intervention based on their previous job performance. The lack of randomization has important consequences, as subjects exhibiting different values of the causal variable cannot be readily compared, because they may also differ on other, unobserved characteristics (i.e., those driving the self-selection). In other words, any observed effects on outcomes of interest may be due to alternative explanations (Shadish et al., 2002).

Quasi-experiments aim at modeling the nonrandom assignment process, or compensating for the lack of randomization, in other to approximate an experiment as closely as possible. As the cause of interest cannot be directly manipulated, it is key to identify and isolate the origin of variation in the causal variable (Ketokivi & McIntosh, 2017). In essence, quasi-experimental designs apply one or more of three approaches to achieve this: explicitly incorporating an unambiguous observable assignment variable (i.e., the regression discontinuity design), identifying or constructing one or more comparison groups that differ only on the causal variable (similar to an experimental control group), and tracking subjects over time in order to isolate the point at which change in the causal variable occurred (i.e., pre- and posttests).When the assignment process is known, usually because it involves an explicit decision within the organization or at a higher level (e.g., industry, regulatory body, legislation), the criterion on which assignment is based can be used to create a regression discontinuity design (RDD; e.g., Antonakis et al., 2010; Shadish et al., 2002; Sieweke & Santoni, 2020). The RDD is applicable whenever there is a known threshold for assigning values of the causal variable, as in the employee-training example. Similarly, a regulatory intervention at the firm level, such as new financial disclosure rules, may be contingent on some threshold, such as firm size. The threshold can be based on any variable that is measured prior to the assignment (i.e., intervention) provided that it is the sole determinant of assignment. Common examples are actual assignment rules, including “pretest” (i.e., a priori) measures of the outcome variable, but also time (e.g., a date or time representing a deadline). The RDD then provides relatively strong causal inference by exploiting the fact that subjects just below and above the threshold value, respectively, are very similar to each other except for being assigned to different values, thus resembling random assignment. That is, given a sufficient number of cases close to the threshold value, the difference in the outcome variable (i.e., the discontinuity) between subjects just below and above the threshold value will provide an estimate of the causal effect approaching the validity of an experiment.

For example, Steffens et al. (2017) used the RDD in a natural experiment investigating how leaders’ perceived charisma is affected by their death. The passage of time served as the causal variable, with the time of death representing the threshold across which levels of charisma (measured through media reports) were compared. Based on the plausible assumption that leader charisma would remain largely unaffected by other factors during the time period immediately before and after death, the researchers were able to rule out alternative explanations such as history (i.e., increased charisma due to other events) and attrition (less charismatic leaders dropping out of the sample).

Yet, often the assignment process is more ambiguous, usually because subjects self-select into values of the causal variable. For example, employees may be given the option to participate in a training program, or firms may adopt a certain policy through some unknown decision process. In such situations, a combination of comparison groups and repeated measurements of the outcomes of interest, if available, are useful. With comparison groups, the goal is to construct an approximation of a randomized control group by carefully matching the focal subjects exhibiting one value of the causal variable (e.g., attending the training, adopting the policy) with subjects who exhibit a different value (e.g., not attending the training, adopting a different policy), but are very similar in all other respects. Matching is typically based on a large number of covariates in order to cover a wide variety of potential explanations for self-selection, often through computing propensity scores expressing the probability of self-selection (e.g., Austin, 2011; Li, 2013).

Alternatively, or additionally, repeated measures of the outcome can be collected before (“pretest”) and/or after (“posttest”) the variation in the causal variable (e.g., training program, policy change) occurs. These measures are useful in assessing to what extent change in the outcome occurred only after the change in the causal variable, and within a theoretically plausible time frame. A common design using a comparison group as well as a pretest and posttest is the “difference in difference design” (Antonakis et al., 2010; Sieweke & Santoni, 2020), also known as the “pretest-posttest design with a nonequivalent control group” (Shadish et al., 2002; Stone-Romero, 2011). In this design, the treatment group(s) and the comparison group(s) are compared on both the posttest and the pretest. Ideally, groups exhibit the same scores on the pretest, indicating comparability as well as the ability to rule out alternative explanations such as history and selection (see Table 2). For example, this condition was met in the study by Derue et al. (2012) investigating the impact of after-event reviews on leadership development, permitting a valid posttest comparison. In fact, this particular study is a good example of a cohort design (Shadish et al., 2002), in which successive, naturally occurring groups are contrasted. The comparison is valid to the extent that the groups differ as little as possible on any variable other than the dependent variable, which the researchers assessed based on a host of variables.

Another relatively well-established design, which does not require a comparison group (but is able to accommodate them), is the “interrupted time series design” (Stone-Romero, 2011), wherein the group of primary interest is tracked over an extended period of time, with many repeated measures of the outcome. This time series enables researchers to model the change in the trajectory of the dependent variable after versus before an intervention, which can be useful in ruling out alternative explanations such as history. Generally, the greater the number of repeated measures and comparison groups, and the larger and better matched the latter, the stronger the quasi-experimental design. Many variations are possible, depending on feasibility, such as switching replications, waitlist designs, or non-equivalent dependent variable designs (see Shadish et al., 2002; Stone-Romero, 2011).

For example, Study 1 by Grant and Hofmann (2011) combined an interrupted time-series design with comparison groups. Because they measured the dependent variable (i.e., job performance) repeatedly for each subject both before and after interventions (i.e., speeches delivered by leaders as well as beneficiaries of subjects’ work) occurred, they were able to rule out a number of alternative explanations for their results. These include selection and maturation, as subjects served as their own controls, and history, as the comparison groups were exposed to the same changes external to the study as the treatment groups. Overall, this made for a relatively strong design even in the absence of randomization (see also the discussion of limitations in Grant & Hofmann, 2011).

As quasi-experiments are usually conducted in field (i.e., non–special purpose) settings, a common challenge lies in gaining empirical access to the setting (Podsakoff & Podsakoff, 2019). In addition, because the benefits of random assignment (as in experiments) and tight control over the setting (as in the laboratory) are unavailable, researchers must carefully consider potential alternative explanations and plan their design accordingly (Grant & Wall, 2009).


In non-experiments, all variables of interest are measured, that is, not manipulated or controlled by the researcher or exogenous events. Non-experiments are sometimes also referred to as “observational” or “correlational” designs. The latter labels are problematic because observation is a data collection mode (Crano et al., 2015), correlation is a statistical model (Cohen et al., 2010), and neither are research designs. Yet, these labels do hint at the core weakness of ambiguity about the source of variation in the causal variables of interest, as the researcher is (largely) confined to taking the phenomenon at hand “as is”, with very limited options to isolate the effect of interest and infer causality (Ketokivi & McIntosh, 2017).

Yet, there are important differences among non-experimental designs. Longitudinal designs (e.g., Ployhart & Vandenberg, 2010; Ployhart & Ward, 2011) feature repeated measures of all relevant variables (that are not stable) over at least three points in time. When more rigorous designs are unavailable, longitudinal designs can be very useful in disentangling the temporal order among variables, because they allow for distinguishing between-subjects variation (i.e., stable differences such as personality or company industry) from within-subject variation (i.e., changes over time), the latter of which is often more relevant. Yet, this requires knowledge (i.e., theory) about the patterns of stability and change of each variable. The frequency of measurements then has to match those patterns in order to adequately capture change over time (Aguinis & Bakker, in press; Spector, 2019), which is an important challenge in using longitudinal designs. As an example, in their study of newcomer trust development during organizational socialization, van der Werff and Buckley (2017) collected four measures of their variables of interest over the course of 12 weeks based on theory explaining the socialization phenomenon. In the absence of experimental or quasi-experimental controls, further challenges are associated with ruling out alternative explanations though the use of statistical techniques (Hill et al., 2021; Ketokivi & McIntosh, 2017).

By comparison, cross-sectional designs, in which each variable is measured only once, compound between-subjects and within-subject variation, and thus make it impossible to distinguish stable differences from change over time. As many research questions in business and management are related to change, this limitation makes cross-sectional designs rather unattractive. Unfortunately, a large proportion of business and management studies still adopt cross-sectional designs (Spector, 2019), in spite of severely reduced rigor and thus relevance (Antonakis, 2017). Importantly, and contrary to a common misconception, separating measurements across time (e.g., measuring X at Time 1, and Y at Time 2) offers very little improvement. This is because the design still prevents identifying when change occurred in each variable. Unless the X (i.e., the causal variable) can be assumed to be stable and/or exogenous, which is often unrealistic in business and management research, both temporal precedence and alternative explanations are problematic. Thus, cross-sectional non-experimental designs pose a host of challenges to researchers, leaving only a limited set of research questions these designs may be able to address (Spector, 2019).

Practical Considerations and Data Analysis

In addition to selecting an appropriate design, a number of practical considerations are important in implementing the designs described here, of which this article can provide only an overview, ordered roughly along the research process (Crano et al., 2015). Assuming that research questions and hypotheses (if any) have been formulated, finding a suitable sample is important. Beyond more traditional types of samples recruited from student participant pools or “real” organizations (Crano et al., 2015; Landers & Behrend, 2015), a trend in business and management research since roughly the beginning of the 21st century is the use of subjects sampled from online panels or similar platforms, sometimes labeled “eLancing” (Aguinis & Lawal, 2012; see also Cheung et al., 2017; Keith et al., 2017). Converging evidence suggests that, contrary to initial concerns, online samples can be useful, as they tend to provide population parameter estimates similar to those from more traditional samples (Walter et al., 2019). More generally, every type of sample is characterized by particular strengths and weaknesses (see Landers & Behrend, 2015), and researchers should carefully weigh those when deciding on potential sampling strategies. In addition to the type of sample, sample size is a “classic” issue in business and management research (Aguinis & Vandenberg, 2014). Briefly, obtaining a sample of sufficient size is important in order to maintain adequate power to detect the effects of interest (Aguinis & Vandenberg, 2014; Crano et al., 2015), but also to ensure that randomization is effective, based on the assumption of comparability among subjects (Lonati et al., 2018). Moreover, some advanced techniques for data analysis require relatively large samples to converge on trustworthy estimates.

In order to attain strong construct validity, any measures and materials to be used in a study must be carefully developed and tested. While this principle applies to measures in any type of research design (e.g., Howard, 2018; Robinson, 2018; Wright et al., 2017), experimental designs add the requirement of using valid manipulations of the causal variables, which can take a variety of forms (Crano et al., 2015), including the recently more common vignette studies (Aguinis & Bradley, 2014; Lonati et al., 2018). As with selecting a suitable sample, researchers must weigh whether their manipulation(s) can be effectively and validly conveyed through vignettes. Some phenomena may not be amenable to such a “hypothetical” representation (Aguinis & Bradley, 2014), requiring the more concrete approach used in traditional laboratory experiments, potentially involving human (Crano et al., 2015) or electronic confederates (Leavitt et al., 2021). This includes real-world phenomena involving high stakes for those involved (Lonati et al., 2018).

Beyond testing those stimuli in separate samples before use in the actual research study, researchers are routinely interested in assessing whether eliciting the intended manipulation was effective through the use of manipulation checks (Ejelöv & Luke, 2020). These are measures of the independent variable that are included in the experimental materials, usually after the manipulation or at the end of the experiment. While the notion of checking the effect of manipulation is laudable in principle, including manipulation checks without careful consideration can introduce unwanted side effects (Ejelöv & Luke, 2020; Lonati et al., 2018). When used before measuring the dependent variables, manipulation checks can create demand characteristics, where subjects develop expectations about the purpose of the study, which may affect their responses to the stimuli and thus bias results (Podsakoff & Podsakoff, 2019). A useful solution is to add a randomized group with the sole purpose of checking manipulations (Lonati et al., 2018).

Third, and finally, researchers can choose from a variety of techniques for analyzing data from experiments or other types of designs. The choice of technique can have important consequences for statistical conclusion validity as well as internal validity and construct validity. For experiments, the traditional models of choice have been variants of the General Linear Model, such as analysis of variance and multiple linear regression (e.g., Tabachnick & Fidell, 2007). These models may be appropriate in many designs with simple structures, but they tend to be relatively rigid as well as based on a number of assumptions that might not hold in business and management settings, such as error-free measurement. As a more general and flexible modeling framework, structural equation models (SEMs) can offer some advantages (Breitsohl, 2019). The flexibility of SEMs allows researchers to specify a vast range of models including those common in experimental (Breitsohl, 2019), quasi-experimental (Reichardt, 2019), and longitudinal designs (Newsom, 2015). SEMs can also help strengthen construct validity by explicitly modeling measurement structures (e.g., multi-item scales used in questionnaires), including measurement error. This, in turn, increases statistical power, and thus improves statistical conclusion validity (Aguinis & Vandenberg, 2014). Yet, it is worth noting that due to the relatively large number of estimated parameters, SEMs tend to require relatively large sample sizes.

For natural experiments and quasi-experiments, relatively specific statistical models are available, depending on the design, including linear regression, instrumental variables, matching algorithms, and multilevel models (see Reichardt, 2019; Sieweke & Santoni, 2020). Perhaps the largest selection is available for the vast variety of non-experimental designs. Focusing on longitudinal designs, commonly applied models include multilevel models as well as different types of models from the structural equation modeling framework (Ketokivi & McIntosh, 2017; Liu et al., 2016; Newsom, 2015; Ployhart & Vandenberg, 2010). Yet, common to all designs and statistical models, complete and transparent reporting of the methods used is key to conducting credible and trustworthy research (Aguinis et al., 2018).

Further Reading