Show Summary Details

Page of

Printed from Oxford Research Encyclopedias, Economics and Finance. Under the terms of the licence agreement, an individual user may print out a single article for personal use (for details see Privacy Policy and Legal Notice).

date: 26 January 2021

# Machine Learning in Policy Evaluation: New Tools for Causal Inference

• Noémi KreifNoémi KreifCentre for Health Economics, University of York
•  and Karla DiazOrdazKarla DiazOrdazDepartment of Medical Statistics, London School of Hygiene and Tropical Medicine

### Summary

While machine learning (ML) methods have received a lot of attention in recent years, these methods are primarily for prediction. Empirical researchers conducting policy evaluations are, on the other hand, preoccupied with causal problems, trying to answer counterfactual questions: what would have happened in the absence of a policy? Because these counterfactuals can never be directly observed (described as the “fundamental problem of causal inference”) prediction tools from the ML literature cannot be readily used for causal inference. In the last decade, major innovations have taken place incorporating supervised ML tools into estimators for causal parameters such as the average treatment effect (ATE). This holds the promise of attenuating model misspecification issues, and increasing of transparency in model selection. One particularly mature strand of the literature include approaches that incorporate supervised ML approaches in the estimation of the ATE of a binary treatment, under the unconfoundedness and positivity assumptions (also known as exchangeability and overlap assumptions).

This article begins by reviewing popular supervised machine learning algorithms, including trees-based methods and the lasso, as well as ensembles, with a focus on the Super Learner. Then, some specific uses of machine learning for treatment effect estimation are introduced and illustrated, namely (1) to create balance among treated and control groups, (2) to estimate so-called nuisance models (e.g., the propensity score, or conditional expectations of the outcome) in semi-parametric estimators that target causal parameters (e.g., targeted maximum likelihood estimation or the double ML estimator), and (3) the use of machine learning for variable selection in situations with a high number of covariates.

Since there is no universal best estimator, whether parametric or data-adaptive, it is best practice to incorporate a semi-automated approach than can select the models best supported by the observed data, thus attenuating the reliance on subjective choices.

### Overview

Most scientific questions, such as those asked when evaluating policies, are causal in nature, even if they are not specifically framed as such. Causal inference reasoning helps clarify the scientific question and define the corresponding causal estimand, that is, the quantity of interest, such as the average treatment effect (ATE). It also makes clear the assumptions necessary to express the estimand in terms of the observed data, known as identification. Once this is achieved, the focus shifts to estimation and inference. While machine learning methods have received a lot of attention in recent years, these methods are primarily geared for prediction. There are many excellent texts covering machine learning focused on prediction (Friedman, Hastie, & Tibshirani, 2001; James, Witten, Hastie, & Tibshirani, 2013), but not dealing with causal problems. Recently, some authors within the economics community have started examining the usefulness of machine learning for the causal questions that are typically the subject of applied econometric research (Athey, 2017a, 2017b; Athey & Imbens, 2017; Kleinberg, Ludwig, Mullainathan, & Obermeyer, 2015; Mullainathan & Spiess, 2017; Varian, 2014).

In this article, we contribute to this literature by providing an overview and an illustration of machine learning (ML) methods for causal inference, with a view to answering typical causal questions in policy evaluation, and show how these can be implemented with widely used statistical packages. We draw on innovations from a wide range of quantitative social and health sciences, including economics, biostatistics, and political science.

We focus on methods to estimate the ATE of a binary treatment (or exposure), under “no unobserved confounding assumptions” (see “Estimands and Assumptions for Identification”). The remainder of this article is structured as follows. First, we introduce the “Notation and Assumptions” for the identification of causal effects. Then we outline our “Illustrative Example” of a treatment effect estimation problem, an impact evaluation of a social health insurance program in Indonesia. Next, we provide a brief “Introduction to Machine Learning for Causal Inference.” In the following sections, we review methods for estimating the ATE. These can be roughly categorized into three main types: methods that aim to balance covariate distributions (“Propensity Score” [PS] and other matching methods), methods that fit outcome regressions to “impute” potential outcomes and estimate causal effects, and the so called “double-robust” methods that combine these. We also discuss the use of ML for “Variable Selection,” a challenge increasingly important with “big data,” especially with a large number of variables. In the last sections we provide a brief overview, “Further Topics” of developments for other settings and a “Discussion.”

### Estimands and Assumptions for Identification

#### Notation and Assumptions

Let A be an indicator variable for treatment and Y be the outcome of interest. Denote by $Yia$ the potential outcome that would manifest if the i-th subject were exposed to level a of the treatment, with $a∈{0,1}$. The observed outcome can then be written as $Yi=Yi0(1−Ai)+Yi1Ai$ (Rubin, 1978).

Throughout, we assume the stable unit treatment value assumption (SUTVA) holds, which comprises no interference, that is, the potential outcomes of the i-th individual are unrelated to the treatment status of all other individuals, and consistency, that is, for all individuals $i=1,…,N$, if $Ai=a$ then $Yia=Yi$, for all a (Cole & Frangakis, 2009; Pearl, 2010; Robins, Hernan, & Brumback, 2000; VanderWeele, 2009).

Denote the observed data of each individual by $Oi=(Xi,Ai,Yi)$, where $Xi$ is a vector of confounding variables, that is, factors that influence simultaneously the potential outcomes and treatment. We assume that the data are an independent identically distributed sample of size N. Individual-level causal effects are defined as the difference between these “potential outcomes.” Researchers are often interested in the average of these individual causal effects over some population. A widely considered casual estimand is the ATE, defined as $ψ=E[Yi1−Y10]$. Further estimands include the average taken over the treated subjects (the average treatment effect on the treated, ATT) or the conditional average treatment effect (CATE), which takes the expectation over individuals with certain observed characteristics. Here we focus on the ATE.

Since the potential outcomes can never be directly observed simultaneously, these estimands cannot be expressed in terms of observed data, or identified, without further assumptions. A commonly invoked assumption which we will make throughout is ignorability or unconfoundedness of the treatment assignment (also known as conditional exchangeability). This assumption requires that the potential outcomes are independent of treatment, conditional on the observed covariates,

$Display mathematics$(1)

The plausibility of this assumption needs to be carefully argued in each case, ideally with careful data collection and based on subject matter knowledge about the variables that may be associated with the outcome as well as influencing the treatment, as it cannot be tested using the observed data (Rubin, 2005).

The second necessary assumption is the positivity of the treatment assignment (also referred as “overlap”):

$Display mathematics$(2)

implying that for any combination of covariates, there is a nonzero probability of receiving both the treatment and control states.

Using the unconfoundedness and the positivity assumptions, the conditional mean of the potential outcomes corresponds with the conditional mean of the observed outcomes,

$E[Yi1|Xi,Ai=1]=E[Yi|Xi,Ai=1]$ and $E[Yi0|Xi,Ai=0]=E[Yi|Xi,Ai=0]$, and the ATE can be identified by:

$Display mathematics$(3)

### Illustrative Example: The Impact of Health Insurance on Assisted Birth in Indonesia

We illustrate the methods by applying them each in turn to an impact evaluation of a national health insurance program in Indonesia (more details in Kreif et al., 2018). The data set consists of births between 2002 and 2014, extracted from the Indonesian Family Life Survey (IFLS). The policy of interest, that is, the treatment, is “being covered by the health insurance offered for those in formal employment and their families” (contributory health insurance). We are interested in the ATE of such health insurance on the probability of the birth's being assisted by a healthcare professional (physician or midwife). We construct a list of observed covariates, including the mother’s characteristics (age, education, wealth in quintiles) and household’s characteristics (social assistance, experienced a natural disaster, rurality, availability of health services—a village midwife, birth clinic, hospital).

We expect that the variables describing socioeconomic status may be particularly important, because those with contributory insurance tend to work in the formal sector and have higher education than those uninsured, and these characteristics would make a mother more likely to use health services even in the absence of health insurance. Similarly, the availability of health services is expected to be an important confounder, as those who have health insurance may live in areas where it is easier to access healthcare, with or without health insurance.

The final data set reflects typical characteristics of survey data: the majority of variables are binary, with two variables categorical and one continuous (altogether 34 variables). Due to the nature of the survey, for around one third of women we could not measure confounder information from the past but had to impute it with information from the time of the survey. Two binary variables indicate imputed observations.

For simplicity, any records with any other missing data have been listwise deleted. This approach provides unbiased estimates of the ATE as long as missingness does not depend on both the treatment and the outcome (Bartlett, Harel, & Carpenter, 2015). The resulting complete-case data set consists of 10,985 births, of which 1,181 are in the treated group, as the mother had health insurance in the year of the child’s birth, while 8,574 babies had their birth assisted by a health professional.

### Introduction to Machine Learning for Causal Inference

#### Supervised Machine Learning

The type of ML tools most useful for causal inference are those labeled “supervised machine learning.” These tools, similarly to regression, can summarize linear and nonlinear relationships in the data and can predict some Y variable given new values of covariates (A, X) (Varian, 2014). A “good” prediction is defined in relation to a loss function, for example, the mean sum of squared errors. A commonly used measure of this is the test mean squared error (test MSE), defined as the average square prediction error among observations not previously seen. This quantity differs from the usual MSE calculated among observations that were used to fit the model. In the absence of a very large data set that can be used to directly estimate the test MSE, it can be estimated by holding out a subset of the observations, from the model fitting process, using the so-called “V-fold cross-validation” procedure (see, e.g., Zhang, 1993). When performing V-fold cross-validation, the researcher randomly divides the set of observations into V groups (folds). The first group is withheld from the fitting process, and thus is referred to as the test data. The algorithm is then fitted using the data in the remaining V − 1 folds, called the training data. Finally, the MSE is calculated using the test data, thus evaluating the performance of the algorithm. This process is repeated for each fold, resulting in V estimates of the test MSE, which are then averaged to obtain the so-called cross-validated MSE. In principle, it is possible to perform cross-validation with just one split of the sample, though results are highly dependent on the sample split. Thus, typically, V = 5 or V = 10 is used in practice.

The ultimate goal of ML algorithms is to get good out-of-sample predictions minimizing the test MSE (Varian, 2014), typically achieved by a combination of flexibility and simplicity, often described as the “bias—variance trade-off.” The formula for the MSE shows why minimizing it achieves this: $MSE(θ^)=Var(θ^)+Bias(θ^,θ)2$. For example, a nonlinear model with many higher-order terms is more likely to fit the data better than a simpler model; however, it is unlikely that it will fit a new data set similarly well, often referred to as “overfitting.”

Regularization is a general approach that aims to achieve balance between flexibility and complexity, by penalizing more complex models. With less regularization, one does a better job at approximating the within-sample variation, but for this very reason, the out-of-sample fit will typically get worse, increasing the bias. The key question is choosing the level of regularization: how to tune the algorithm so that it does not over-smooth or overfit.

In the context of selecting regularization parameters, cross-validation can be used to perform so-called empirical tuning to find the optimal level of complexity, where complexity is often indexed by the tuning parameters. The potential range of tuning parameters is divided into a grid (e.g., ten possible values), and V-fold cross-validation process is performed for each parameter value, enabling the researcher to choose the value of the tuning parameter with the lowest test MSE. Finally, the algorithm with the selected tuning parameter is refitted using all observations.

#### Prediction Versus Causal Inference

ML is naturally suited to prediction problems, which have been traditionally treated as distinct from causal questions (Mullainathan & Spiess, 2017). It may be tempting to interpret causally the output of the machine learning predictions, however, making inferences from ML models is complicated by (1) the lack of interpretable coefficients for some of the algorithms, and (2) the lack of standard errors (Athey, 2017a). Moreover, for certain “regression-like” algorithms (e.g., Lasso), selecting the best model using cross-validation, and then doing inference for the model parameters, ignoring the selection process, though common in practice, should be avoided as it leads to potential biases stemming from the shrunk coefficients and underestimation of the true variance in the parameter estimates (Mullainathan & Spiess, 2017).

The causal inference literature (see, e.g., Kennedy, 2016; Petersen, 2014; van der Laan & Rose, 2011) stresses the importance of first defining the causal estimand of interest (also referred to as “target parameter”), and then carefully thinking about the necessary assumptions for identification. Once the causal estimand has been mapped to an estimator (a function of the observed data), via the identification assumptions, the problem becomes an estimation exercise. In practice, many estimators involve models for parameters (e.g., conditional distributions, means), which are not of interest per se, but are necessary to estimate the target parameter; these are called nuisance models. Nuisance models estimation can be thought of as a prediction problem for which ML can be used. Examples include the estimation of propensity scores, or outcome regressions that can later be used to predict potential outcomes (see “Further Topics”). So while most ML methods cannot be readily used to infer causal effects, they can help the process. A potential key advantage of using ML for the nuisance models is that it fits and compares many alternative algorithms, by, for example, using cross-validation (while cross-validation can be used to select among parametric models as well). Selecting models based on a well-defined loss function (e.g., the cross-validated MSE) can, beyond improving model fit, benefit the overall transparency of the research process (Athey, 2017b). This is in contrast to the way that model selection is usually viewed in economics, where a model is chosen based on theory and estimated only once.

This has led to many researchers using ML for the estimation of the nuisance parameters of standard estimators (e.g., outcome regression, inverse probability weighting by the propensity score; see, e.g., Lee, Lessler, & Stuart, 2010; Westreich, Lessler, & Funk, 2010). However, the behavior of these estimators can be poor, resulting in slower convergence rates and confidence intervals that are difficult to construct (van der Vaart, 2014). In addition, the resulting estimators are irregular and the nonparametric bootstrap is in general not valid (Bickel, Götze, & van Zwet, 1997).

An increasingly popular strategy to avoid these biases and to have valid inference is to use the so-called doubly robust estimators (combining nuisance models for the outcome regressions and the propensity score), which we review in “Double-Robust Estimation With Machine Learning.” This is because DR estimators can converge at fast rates ($N$) to the true parameter, and are therefore consistently asymptotically normal, even when the nuisance models have been estimated via ML.

In the following sections, we briefly describe the ML approaches that have been most widely used for nuisance model prediction in causal inference, either because of their similarity to traditional regression approaches, their easy implementation due to the availability of statistical packages, their superior performance in prediction, or a combination of these.

#### Lasso

Lasso (least absolute shrinkage and selection operator) is a penalized linear (or generalized linear) regression algorithm, fitting the model including all d predictors. It aims to find the set of coefficients that minimize the sum-of-squares loss function, but subject to a constraint on the sum of absolute values (or `$l1$ norm) of coefficients being equal to a constant, c, often referred to as budget, that is, $∑j=id||βj||1=c$. This results in a (generalized) linear regression in which only a small number of covariates have nonzero coefficient: this absolute-value regularizer induces a sparse coefficient vector. The nonzero coefficient estimates are also shrunk toward zero. This significantly reduces their variance at the “price” of increasing the bias. An equivalent formulation of the lasso is

$Display mathematics$(4)

with the penalty λ‎ being the tuning parameter.

As λ‎ increases, the flexibility of the lasso regression fit decreases, leading to decreased variance but increased bias. Beyond a certain point, however, the decrease in variance due to increasing λ‎ slows, and the shrinkage on the coefficients causes them to be significantly underestimated, resulting in a large increase in the bias. Thus the choice of λ‎ is critical. This is usually done by cross-validation, implemented by several packages such as glmnet and caret.

Because the lasso results in some of the coefficients, being exactly zero when the penalty λ‎ is sufficiently large, it essentially performs variable selection. The variable selection, however, is driven by the tuning parameter, and it can happen that some variables are selected in some of the CV partitions but may be unused in another. This problem is common when the variables are correlated with each other, and they explain very similar “portions” of the outcome variability. A practical implication of this is that the researcher should remove from the set of potential variables those that are irrelevant, in the sense that they are very correlated to a combination of other, more relevant ones.

Another problem is inference after model selection, with some results (Leeb & Benedikt, 2008) showing it is not possible to obtain (uniform) model selection consistency. As we demonstrate in the section “Variable Selection,” some uses of lasso enable consistent estimation of post-variable selection.

#### Tree-Based Methods

##### Regression Trees

Tree-based methods, also known as classification and regression trees or “CARTs,” have a similar logic to decision trees familiar to economists, but here the “decision” is a choice about how to classify the observation. The goal is to construct (or “grow”) a decision tree that leads to good out-of-sample predictions. They can be used for classification with binary or multicategory outcomes (“classification trees”) or with continuous outcomes (“regression trees”). A regression tree uses a recursive algorithm to estimate a function describing the relationship between a multivariate set of independent variables and a single dependent variable, such as treatment assignment.

Trees tend to work well for settings with nonlinearities and interactions in the outcome–covariate relationship. In these cases, they can improve upon traditional classification algorithms like logistic regression. In order to avoid overfitting, trees are pruned by applying tuning parameters that penalize complexity (the number of leaves). A major challenge with tree methods is that they are sensitive to the initial split of the data, leading to high variance. Hence, single trees are rarely used in practice, but instead ensembles—algorithms that stack or add together different algorithms—of trees are used, such as random forest or boosted CARTs.

##### Random Forests

Random forests are constructed using bootstrapped samples of the data and then growing a tree where only a (random) subset of covariates is used for creating the splits (and thus the leaves). These trees are then averaged, which leads to a reduction in variance. The tuning parameters, which can be set or selected using cross-validation, include the number of trees, depth of each tree, and the number of covariates to be randomly selected (usually recommended to be approximately $d$ where d is the number of available independent variables). Popular implementations include the R packages caret and ranger.

##### Boosting

Boosting generates a sequence of trees where the first tree’s residuals are used as outcomes for the construction of the next tree. Generalized boosted models add together many simple functions to estimate a smooth function of a large number of covariates. Each individual simple function lacks smoothness and is a poor approximation to the function of interest, but added together, they can approximate a smooth function just like a sequence of line segments can approximate a smooth curve. In the implementation in the R package gbm (McCaffrey, Ridgeway, & Morral, 2004), each simple function is a regression tree with limited depth. Another popular package is xgboost.

Bayesian additive regression trees (BARTs) can be distinguished from other tree-based ensembling algorithms due to its underlying probability model (Kapelner & Bleich, 2016). As a Bayesian model, BART consists of a set of priors for the structure and the leaf parameters and a likelihood for data in the terminal nodes. The aim of the priors is to provide regularization, preventing any single regression tree from dominating the total fit. To do this, BARTs employ so-called “Bayesian backfitting” where the j-th tree is fit iteratively, holding all other m − 1 trees constant by exposing only the residual response that remains unfitted. Over many MCMC iterations, trees evolve to capture the fit left currently unexplained (Kapelner & Bleich, 2016).

BART is described as particularly well-suited to detecting interactions and discontinuities and typically requires little parameter tuning (Hahn, Murray, & Carvalho, 2017). There is ample evidence on BART’s good performance in predictions and even in causal inference (Dorie, Hill, Shalit, Scott, & Cervone, 2017; Hill, 2011), and it is implemented in several R packages (bartMachine, dbarts). Despite its excellent performance in practice, there are limited theoretical results about BARTs.

#### Super Learner Ensembling

Varian (2014) highlights the importance of recognizing uncertainty due to the model selection process, and the potential role ensembling can play in combining several models to create one that outperforms single models. Here we focus on the Super Learner (SL; van der Laan & Dudoit, 2003), a ML algorithm that uses cross-validation to find the optimal weighted convex combination of multiple candidate prediction algorithms. The algorithms prespecified by the analyst form the library and can include parametric and ML approaches. The SL has the oracle property, that is, it produces predictions that are at least as good as those of the best algorithm included in the library (for details, see van der Laan, Polley, & Hubbard, 2007; van der Laan & Rose, 2011).

Beyond its use for prediction (Polley & van der Laan, 2010; Rose, 2013), it has been used for propensity score and outcome model estimation see, for example, (Eliseeva, Hubbard, & Tager, 2013; Gruber, Logan, Jarrín, Monge, & Hernán, 2015; van der Laan & Luedtke, 2014) and has been shown to reduce bias from model misspecification (Kreif, Gruber, Radice, Grieve, & Sekhon, 2016; Pirracchio, Petersen, & van der Laan, 2015; Porter, Gruber, van der Laan, & Sekhon, 2011). Implementations of the SL include the SuperLearner, h2oEnsembleR and the subsemble R, packages, the latter two with increased computational speed to suit large data sets.

### Machine Learning Methods to Create Balance Between Covariate Distributions

#### Propensity Score Methods

The propensity score (PS; Rosenbaum & Rubin, 1983), defined as the conditional probability of treatment A given observed covariates, that is, $p(xi)=P(Ai=1|Xi=xi)$, is referred to as a “balancing score,” due to its property of balancing the distributions of observed confounders among the treatment and control groups. The PS has been widely used to control for confounding, either for subclassification (Rosenbaum & Rubin, 1984), as a metric to establish matched pairs in nearest neighbor matching (Abadie & Imbens, 2016; Rubin & Thomas, 1996), and for reweighting, using inverse probability of treatment weights (Hirano, Imbens, & Ridder, 2003). The latter two approaches have been demonstrated to have the best performance (Austin, 2009; Lunceford & Davidian, 2004).

The PS matching estimator constructs the missing potential outcome using the observed outcome of the closest observation(s) from the other group, and calculates the ATE as a simple mean difference between these predicted potential outcomes (Abadie & Imbens, 2006, 2011, 2016). The inverse probability of treatment weighting (IPTW) estimator for the ATE is simply a weighted mean difference between the observed outcomes of the treatment and control groups, where the weights, $wi$, are constructed from the estimated propensity score as

$Display mathematics$(5)

With a correctly specified $p(X)$, $ψIPTW$ is consistent and efficient (Hirano et al., 2003). The IPTW estimator can be expressed as

$Display mathematics$(6)

Obtaining SEs for IPTW estimators can be done by the Delta method, assuming the PS is known, or using a robust covariance matrix, so-called sandwich estimator, to acknowledge that the PS was estimated, or by bootstrapping. IPTW estimators are sensitive to large weights.

The validity of these methods depends on correctly specifying the PS model. In empirical work, typically probit or logistic regression models are used without interactions or higher-order terms. However, the assumptions necessary for these to be correctly specified, for example, the linearity of the relationship between covariates and probability of treatment in the logit scale, are rarely assessed (Westreich et al., 2010). More flexible modeling approaches, such as series regression estimation (Hirano et al., 2003), and ML methods, including decision trees, neural networks, and linear classifiers (Westreich et al., 2010), generalized boosting methods (Lee et al., 2010; McCaffrey et al., 2004; Westreich et al., 2010; Wyss et al., 2014) or the SL (Pirracchio et al., 2012) have been proposed to improve the specification of the PS. However, even such methods may have poor properties, if their loss function targets measures of model fit (e.g., log likelihood, area under the curve) instead of balancing covariates that are important to reduce bias (Westreich, Cole, Funk, Brookhart, & Stürmer, 2011). Imai and Ratkovic (2014) proposed a score that explicitly balances the covariates, exploiting moment conditions that capture the desired mean independence between the treatment variable and the covariates that the balancing aims to achieve. A ML method for estimating PSs that aims to maximize balance is the boosted CART approach (McCaffrey et al., 2004; Lee et al., 2010), implemented as the TWANG R package. This approach minimizes a chosen loss function, based on covariate balance achieved in the IPTW weighted data, by iteratively forming a collection of simple regression tree models and adding them together to estimate the PS. It models directly the log-odds of treatment rather than the PSs, to simplify computations. The algorithm can be specified to stop when the best balance is achieved. A recommended stopping rule is the average standardized absolute mean difference (ASAM) in the covariates. Beyond the balance metric, the number of iterations, depth of interactions, and shrinkage parameters need to be specified. The boosted CART approach to estimating PS has been demonstrated to improve balance and reduce bias in the estimated ATE (Lee et al., 2010; Setoguchi, Schneeweiss, Brookhart, Glynn, & Cook, 2008) and has been extended to settings with continuous treatments (Zhu, Coffman, & Ghosh, 2015).

#### Methods Aiming to Directly Create Balanced Samples

There is an extensive literature on methods that aim to create matched samples that are automatically balanced on the covariates, instead of estimating and matching on a PS. An extension of Mahalanobis distance matching, the Genetic Matching algorithm (Diamond & Sekhon, 2013) searches a large space of potential matched treatment and control groups to minimize loss functions based on test statistics describing covariate imbalance (e.g., Kolmogorov–Smirnov tests). The accompanying Matching R package (Mebane & Sekhon, 2011; Sekhon, 2011) has a wide range of matching methods (including PS), matching options (e.g., with or without replacement, 1:1 or 1:m matching), estimands (ATE vs ATT) and balance statistics. The “genetic” component of the matching algorithm chooses weights to give relative importance to the matching covariates to optimize the specified loss function. The algorithm proposes batches of weights, “a generation,” and moves toward the batch of weights that maximize overall balance. Each generation is then used iteratively to produce a subsequent generation with better candidate weights. The “population size”—the size of each generation—is the tuning parameter to be specified by the user.

Similar approaches to creating optimal matched samples with a different algorithmic solution are offered by Zubizarreta (2012) and Hainmueller (2012). Both approaches use integer programming optimization algorithms to construct comparison groups given balance constraints (maximum allowed imbalance) specified by the user, in the former case by one-to-many matching, in the latter case by constructing optimal weights.

#### Demonstration of Balancing Methods Using the Birth Data Set

We estimate a range of PS: first, using a main terms logistic regression to estimate the conditional probability of being enrolled in health insurance, followed by two data-adaptive PSs. We include all covariates in the prediction algorithms, without prior covariate selection. The first is a boosted CART, with 5,000 trees, of a maximum depth of two interactions, and shrinkage of 0.005. The loss function used is the “average standardized difference.”

Second, we use the SL with a library containing a range of increasingly data-adaptive prediction algorithms:

logistic regression with and without all pairwise interaction,

generalized additive models with two, three, and four degrees of freedom,

random forests–including four random forest learners varying the number of trees (500, 2,000), and the number of covariates to split on (five and eight), implemented in the ranger R package),

boosting–using the R package xgboost, with varying number of trees (100 and 1,000), shrinkage (0.001 and 0.1) and maximum tree depth (one and four),

a BART prediction algorithm using 200 regression trees with the tuning parameters set to default implementation in the darts R package.

We use 10-fold cross-validation and the mean sum of squares loss function. For the purposes of comparison, we have also implemented two 1:1 matching estimators with replacement, for the ATE parameter. First, we created a matched data set based on the boosted CART PS, implemented without calipers. Second, we implemented the Genetic Matching algorithm, using a population size of 500, and a loss function that aims to minimize the largest p-values from paired t-test. We have reassessed the balance for the pair-matched data. Throughout, we evaluate balance based on standardized mean differences, a metric that is comparable across weighting and matching methods (Austin, 2009). We calculate the ATE using IPTW and matching. The SEs for the IPTW are “sandwich” SEs, while for the matching estimators the Abadie–Imbens formula is used (Abadie & Imbens, 2006) that accounts for matching with replacement.

Figure 1 can be inspected to assess the relative performance of the candidate algorithms included in the SL. It displays the cross-validated estimates of the loss function (MSE) after an additional layer of cross-validation, so that the out-of-sample performance of each individual algorithm and the convex combination of these algorithms (SL) can be compared. The “Discrete SL” is defined as an algorithm that gives the best candidate the weight of 1. We see that the convex SL performs best. Table 1 shows the coefficients attributed to the different candidate algorithms in the final prediction algorithm that was used to estimate the PS.

#### Table 1: Nonzero Weights Corresponding to the Algorithms in the Convex Super Learner, for Estimating Propensity Scores

Algorithm’s Weight in Ensemble

Random Forest (500 trees, 5 variables)

0.18

Random Forest (2,000 trees, 5 variables)

0.36

0.46

GLM with all two-way interactions

0.01

For each PS and matching approach, we compare balance on the covariates in the data (reweighted by IPTW weights or by frequency weights from the matching, respectively).

Algorithms labeled by Rg are variations of random forests, algorithms labeled by xgb are variations of the boosting algorithm. Algorithms labeled with SL are implemented in the SuperLearner R package.

Figure 2 displays the absolute standardized differences (ASD) for all the covariates, starting from the variables that were least imbalanced in the unweighted data, moving toward the more imbalanced. Generally, all weighting approaches tend to improve balance compared to the unweighted data, except for variables that were well balanced (ASD < 0.05) to begin with. Using the rule of thumb of 0.1 as a metric of significant imbalance, we find that TWANG, when used for weighting, achieves acceptable balance on all covariates, except for the binary variable indicating having at least secondary education. Based on the more stringent criterion of ASD < 0.05, however, TWANG leaves several covariates imbalanced, including the indicator of rural community, the availability of health center, the availability of birth clinic, and whether the mother can write in Indonesian. When used for pair-matching, the boosted CART-based PS leaves high imbalances. This is expected, as the balance metric in the loss function used the weighted, not the matched, data. The SL-based PS results in the largest imbalance, again reflecting that the loss function was set to maximize the cross-validated MSE of the PS model, and not to maximize balance.1

The estimated ATE results in a 10% increase in the probability of giving birth attended by a health professional, among those with contributory insurance (vs. those without; see Table 2). With all the adjustment methods, this effect decreases, indicating an important role of adjusting for observed confounders. As expected, the method that reported the largest imbalances, IPTW SL, reports an ATE closest to the unadjusted estimate.

#### Limitations of Balancing Methods

Balancing methods allow for the consistent estimation of treatment effects, provided the assumptions of no unobserved confounding and positivity hold. Crucially, the analyst does not need to model outcomes, thus increasing transparency by avoiding cherry-picking of models. While ML can help by making the choices of the PS model or a distance metric data-adaptive, subjective choices remain. For example, the loss function needs to be specified, and if the loss function is based on balance, so does the choice of balance metric. The ASM chosen for demonstration purposes creates a metric of average imbalance for the means. This ignores two potential complexities. First, imbalances in higher moments of the distribution are not taken into account by this metric. While Kolmogorov–Smirnov test statistics can take into account imbalance in the entire covariate distribution (Diamond & Sekhon, 2013; Stuart, 2010) and can be selected to enter the loss function for both the boosted CART and the Genetic Matching algorithms, there is a further issue remaining: how should the researcher trade off imbalance across covariates? With a large number of covariates, balancing one variable may decrease balance on another covariate. Moreover, it is unclear how univariate balance measures should be summarized. The default of the TWANG package is to look at average covariate balance, while the Genetic Matching algorithm, as default, prioritizes covariate balance on the variable that is the most imbalanced. This, however, may not prioritize variables that are relatively strong predictors of the outcome, and any remaining imbalance would translate into a larger bias. Hence, there is an increasing consensus that exploiting information from the outcome regression generally improves on the properties of balance-based estimators (Abadie & Imbens, 2011; Kang & Schafer, 2007). ML methods to estimate the nuisance models for the outcome can provide reassurance against subjectively selecting outcome models that provide the most favorable treatment effect estimate. We review these methods in the following section.

#### Table 2: ATEs and 95% CIs Estimated Using IPTW and Matching Methods

ATE

95% CI L

95% CI U

0.13

0.11

0.16

IPTW logistic

0.06

0.02

0.11

IPTW TWANG

0.08

0.04

0.12

IPTW SL

0.11

0.09

0.13

PS matching TWANG

0.08

0.04

0.12

Genetic Matching

0.06

0.03

0.09

#### Machine Learning Methods for the Outcome Model

Recall that under the unconfoundedness and positivity assumptions, the ATE can be identified by $E[Y|A=1,X]−E[Y|A=0,X]$, reducing the problem to one of estimation of these conditional expectations (Hill, 2011; Imbens & Wooldridge, 2009). Denoting the true conditional expectation function for the observed outcome as $μ(A,X)=E[Y|A,X]$, the regression estimator for the ATE can be obtained as

$Display mathematics$(7)

where $μ^(A=a,X=x)$ can be interpreted as the predicted potential outcome for level a of the treatment among individuals with covariates $X=x$, and can be estimated by, for example, a regression $E[Y|A=a,X=x]=η0(x)+β1a$; with $η0(x)$ the nuisance function and $β1$ the parameter of interest for level $a$ of the treatment. Under correct specification of the models for $μ(A,X)$, the outcome regression estimator is consistent (Bang & Robins, 2005), but it is prone to extrapolation.

The problem can now be viewed as a prediction problem, making it appealing to use ML to obtain good predictions for $μ(A,X)$. Indeed, some methods do this: BARTs have been successfully used to obtain ATEs (Hahn et al., 2017; Hill, 2011). Austin (2012) demonstrates the use of a wide range of tree-based ML techniques to obtain regression estimators for the ATE.

However, there are three reasons why ML is generally not recommended for outcome regression. First, the asymptotic properties of such estimators are unknown. Typically, the convergence of the resulting regression estimator for the causal effect will be slower than $N$ when using ML fits. A related problem is the so-called “regularization bias” (Athey, Imbens, & Wager, 2018; Chernozhukov et al., 2017). Data-adaptive methods use regularization to achieve optimal bias-variance trade-off, which shrinks the estimates toward zero, introducing bias (Mullainathan & Spiess, 2017), especially if the shrunk coefficients correspond to variables that are strong confounders (Athey et al., 2018). This problem increases as the number of parameters compared to sample size grows. Third, it is difficult to conduct inference for causal parameters, as in general there is no way of constructing valid confidence intervals, and the nonparametric bootstrap is not generally valid (Bickel et al., 1997).

This motivates going beyond single nuisance model ML plug-in estimators, and using double-robust estimators with ML nuisance model fits, reviewed in the next section (Athey et al., 2018; Chernozhukov et al., 2017; Farrell, 2015; Seaman & Vansteelandt, 2018).

#### Double-Robust Estimation With Machine Learning

Methods that combine the strengths of outcome regression modeling with the balancing properties of the PS have been long advocated. The intuition is that using PS matching or weighting as a “preprocessing” step can be followed by regression adjustment to control further for any residual confounding (Abadie & Imbens, 2006, 2011; Imbens & Wooldridge, 2009; Rubin & Thomas, 2000; Stuart, 2010). While these methods have performed well in simulations (Busso, DiNardo, & McCrary, 2014; Kreif, Grieve, Radice, & Sekhon, 2013; Kreif et al., 2016), their asymptotic properties are not well understood.

A formal approach to combining outcome and treatment modeling was originally developed to improve the efficiency of IPTW estimators (Robins, Rotnitzky, & Zhao, 1995). Double-robust (DR) estimators use two nuisance models and have the special property that they are consistent as long as at least one of the two nuisance models is correctly specified. In addition, some DR estimators are shown to be semiparametrically efficient, if both components are correctly specified (Robins, Sued, Lei-Gomez, & Rotnitzky, 2007). A simple DR method is the augmented inverse probability of treatment weighting (AIPTW) estimator (Robins, Rotnitzky, & Zhao, 1994). The AIPTW can be written as $ψAIPTW=ψAIPTW(1)−ψAIPTW(0)$, where

$Display mathematics$(8)

where $μ(A,X)=E[Y|A,X]$, as before.

The variance of DR estimators is based on the variance of their influence function. Let $ψ$ be an estimator of a scalar parameter $ψ0$, satisfying

$Display mathematics$(9)

where o(1) denotes a term that converges in probability to 0, and where $E[ϕ(O)]=0$ and $0, that is, ϕ‎(O) has zero mean and finite variance. Then ϕ‎(O) is the influence function (IF) of $ψ^$.

By the central limit theorem, the estimator $ψ^$ is asymptotically normal with asymptotic variance N1 times the variance of its IF. Thus, the empirical variance of the IF can be used to construct normal-based confidence intervals.

A consequence of this convergence behavior is that good asymptotic properties of DR estimators can be achieved even when the convergence rates of the nuisance models are slower than the conventional $N$, as the DR estimator $ψ^$ can still converge at a fast $N$ rate, as long as the product of the nuisance models’ convergence rates is faster than $N$ (under regularity conditions, and empirical process conditions—e.g., Donsker class, which can be avoided via sample splitting, described later; Bickel & Kwon, 2001; Chernozhukov et al., 2017; van der Laan & Robins, 2003; van der Laan & Robins, 2003).

This discovery allows for the use of flexible ML-based estimation of the nuisance functions, leading to an increased applicability of DR estimators, which were previously criticized, given that most likely both nuisance models are misspecified (Kang et al., 2007). Concerns about the sensitivity to extreme PS weights remain (Petersen, Porter, Gruber, Wang, & van der Laan, 2012).

To improve on AIPTW estimators, van Der Laan and Rubin (2006) introduced targeted minimum loss based estimation (TMLE), a class of DR semiparametric efficient estimators. TMLEs “target” or debias an initial estimate of the parameter of interest in two stages (Gruber & van Der Laan, 2009). In the first stage, an initial estimate $μ0(A,X)$ of $E[Y|A,X]$ is obtained (typically by ML) and is used to predict potential outcomes under both exposures, for each individual (van der Laan & Rose, 2011).

In the second stage, these initial estimates are “updated,” by fitting a generalized linear model for $E(Y|X)$, typically with logit link, an offset term logit ${μ0(A,X)}$, and a single so-called clever covariate. When the outcome is continuous but bounded, the update can also be performed on the logit scale (Gruber & van der Laan, 2010). For the ATE, the clever covariate is $h(A,k)=Ap(x)−(1−A)(1−p(x))$. The coefficient ε‎, corresponding to the clever covariate, is then used to update (debias) the estimate of $μ0(A,X)$. The updating procedure continues until a step is reached where $ε=0$. The final update $μ*(A,X)$ is the TMLE. For the special case of the ATE, convergence is mathematically guaranteed in one step, so there is no need to iterate.

This procedure exploits the information in the treatment assignment mechanism and also ensures that the resulting estimator stays in the appropriate model space, that is, it is a substitution estimator. Again, data-adaptive estimation of the PS is recommended (van der Laan & Rose, 2011). Available software implementation of TMLE (R package TMLE; Gruber & van der Laan, 2011) incorporates a SL algorithm to provide the initial predictions of the potential outcomes and the PSs.

Another DR estimator with ML is the so-called double machine learning (DML) estimator (Chernozhukov et al., 2017). For a simple setting of the ATE, this estimator simply combines the residuals of an outcome regression and the residuals of a PS model, into a new regression, motivated by the partially linear regression approach of Robinson (1988). For the more general case when the treatment can have an interactive effect with covariates, the form of the estimator corresponds to the AIPTW estimator, where the nuisance parameters are estimated using ML algorithms. While the estimator does not claim “double-robustness” (as it does not aim to “correctly specify” any of the models), it aims to debias estimates of the ATEs by combining “good enough” estimates of the nuisance parameters. The ML methods used can be highly data-adaptive. This estimator is also semiparametric efficient, under weak conditions, due to an extra step of sample splitting (thus avoiding empirical process conditions; Bickel & Kwon, 2001). The estimator is constructed using “cross-fitting,” which divides the data into K random splits and withholds one part of data from fitting the nuisance parameters, while using the rest of the data to obtain predictions and constructing the ATE. This is then repeated K times, and the average of the resulting estimates is the DML estimate for the ATE. The standard errors are based on the IF (Chernozhukov et al., 2017). Sample splitting is designed to help avoid overfitting and thus reduces the bias. A further adaptation of the method also takes into account the uncertainty about the particular sample splitting, by doing large number of re-partitionings of the data, and taking the mean or median of the resulting estimates as the final ATE, and also correcting the estimated standard errors to capture the spread of the estimates.

##### Demonstration of DR and Double Machine Learning Approaches Using the Birth Data

We begin by fitting a parametric AIPTW using logistic models for both PS and outcome regression. SEs are estimated by nonparametric bootstrap. We then use SL fits for both nuisance models (with the libraries described in “Demonstration of Balancing Methods Using the Birth Data Set”). For the outcome models, we fit two separate prediction models, for the treated and control observations, and obtain the predictions for the two expected potential outcomes, the probabilities of assisted birth under no health insurance and health insurance, given the individual’s observed covariates. We plug these predictions into the standard AIPTW. The SEs are based on the IF (without further modification). Next, we implement the TMLE using the same nuisance model SL fits. SEs are based on the efficient IF, as coded in the R package tmle. Finally, the DML estimates for the ATE are obtained using one split of approximately equal size. The nuisance models are re-estimated in the first split of the sample using the SL with the same libraries as before, and obtaining predictions for the other half of the split sample. We then (in the cross-fitting step) switch the roles of the two samples, and average the resulting estimates with the formulae for SEs based on the influence function, as in Chernozhukov et al., 2017.

The relative weights of the candidate algorithms in the SL library are displayed in Table 3, showing that the highly data-adaptive algorithms (boosting, random forests, and BART) received the majority of the weights. The estimated ATEs with 95% CIs are reported in Table 4. While the point estimates are of similar magnitude (a 5%–8% increase in the probability of assisted birth), their confidence intervals show a large variation. The SL AIPTW and TMLE appear to be very precisely estimated, displaying a narrower CI than the parametric AIPTW, where CIs have been obtained using parametric bootstrap. One potential explanation may be that without the sample splitting, the nuisance parameters may be overfitted, and the IF-based SEs do not take this into account. This is consistent with the finding of Dorie et al. (2017) in simulated data sets that TMLE results in undercoverage of 95% CIs. Indeed, the DML estimator, which only differs from the SL AIPTW estimator in the cross-fitting stage, displays the widest CIs, including zero.

#### Table 3: Nonzero Weights Corresponding to the Algorithms in the Convex Super Learner for Modeling the Outcome

Algorithm’s Weight in SL

Model for Control

Model for Treated

Boosting (100 trees, depth of 4, shrinkage 0.1)

0.02

0.38

Boosting (1,000 trees, depth of 4, shrinkage 0.1)

0.00

0.04

Random forest (500 trees, 5 variables)

0.00

0.36

Random forest (500 trees, 8 variables)

0.00

0.09

Random forest (2,000 trees, 8 variables)

0.32

0.00

BART

0.54

0.00

GLM with no interaction

0.00

0.11

GLM with all two-way interactions

0.12

0.01

### Variable Selection

A problem empirical researchers face when relying on conditioning on a sufficient set of observed covariates for confounding control is variable selection, that is, identifying which covariates to include in the model(s) for conditional exchangeability to hold. In principle, subject matter knowledge should be used to select a sufficient control set (Rubin, 2007). In practice, however, there is often little prior knowledge on which variables in a given data set are confounders. Hence data-adaptive procedures to select the variables to adjust for become increasingly necessary when the number of potential confounders is very large. There is a lack of clear guidance about what procedures to use, and about how to obtain valid inferences after variable selection. In this section, we consider some approaches for variable selection when the focus is on the estimation of causal effects.

#### Table 4: ATE Obtained With Logistic AIPTW, Super Learner Fits for the PS and Outcome Model AIPTW (SL AIPW), TMLE, and DML Estimators Applied to the Birth Data

ATE

95% CI

AIPTW (boot SE)

0.066

0.029

0.103

SL AIPTW

0.081

0.077

0.086

TMLE

0.073

0.065

0.081

DML (1 split)

0.053

-0.026

0.133

Decisions on whether to include a covariate in a regression model, whether these are done manually or by automated methods, such as stepwise regression, are usually based on the strength of evidence for the residual association with the outcome, by for example, iteratively testing for significance in models that include or exclude the variable, and comparing the resulting p-value to a prespecified significance level. Stepwise models (backward or forward selection) are, however, widely recognized to perform poorly (Heinze, Wallisch, & Dunkler, 2018), for two main reasons. First, collinearity can be an issue, which is especially problematic for forward selection, while in high-dimensional settings backward selection may be unfeasible. Second, tests performed during the variable selection process are not prespecified, and this is typically not acknowledged in the subsequent analysis, compromising the validity and the interpretability of subsequent inferences derived from models after variable selection.

Decisions about which covariates to adjust for in a regression must ideally be based on the evidence of confounding, taking into account the covariate–exposure association. Yet causal inference procedures that only rely on the strength of covariate–treatment relationships (e.g., propensity score methods) may also be problematic. For example, they may lead to adjusting for variables that are causes of the exposure only (so-called pure instruments), inducing bias (Vansteelandt, Bekaert, & Claeskens, 2012). On the other hand, if variable selection only relies on modeling the outcome, using for example, lasso regression, it may introduce regularization bias, due to underestimating coefficients, and as a result, mistakenly excluding variables with nonzero coefficients.

To address these challenges, Belloni, Chernozhukov, and Hansen (2014) proposed a solution that offers principled variable selection, taking into account both the covariate–outcome and the covariate–treatment assignment association, resulting in valid inferences after variable selection. Their framework, referred to as “post double selection,” or “double-lasso,” also allows to extend the space of possible confounding variables to include higher-order terms. Following Belloni et al. (2014), we consider the partially linear model $Yi=g(Xi)+β0Ai+ζi$, where $Xi$ is a set of confounder-control variables, and $ζi$ is the error term satisfying $E[ζi|Ai,Xi]=0$. We examine the problem of selecting a set of variables V from among d2 potential variables $Wi=f(X)$, which includes X and transformations of X as to adequately approximate g(X), and allowing for $d2>N$. Crucially, pure instruments—variables associated with the treatment but not the outcome—do not need to be identified and excluded in advance.

We identify covariates for inclusion in our estimators of causal effects in two steps. First, we find those that predict the outcome and then, in a separate second step, those that predict the treatment. A lasso linear regression calibrated to avoid overfitting is used for both models. In a final step, the union of the variables selected in either step is used as the confounder control set, to be used in the causal parameter estimator. The control set can include some additional covariates identified beforehand.

Belloni et al. (2014) show that the double-lasso results in valid estimates of ATE under the key assumption of ultra-sparsity, that is, conditional exchangeability holds after controlling for a relatively small number $s< of variables in X not known a priori. Implementation is straightforward, for example using the glmnet R package. We use cross-validation for choosing the tuning parameter, following Dukes, Avagyan, and Vansteelandt (2018). Once the confounder control set is selected, a standard method of estimation is used, for example ordinary least squares estimation of the outcome regression.

#### Application of Double-Lasso to the Birth Data

We now apply the double-lasso approach for variable selection to our birth data example. We begin by running a lasso linear regression (using the glmnet R package) for both outcome and treatment separately, including all the variables available and using cross-validation to select the penalty. The union of the variables selected for both models was all 35 available covariates. These variables are used to control for confounding first in a parametric logistic outcome regression model, which we use to predict the potential outcomes, and obtain the ATE. We also calculate IPTW and AIPTW estimates using the weights from a parametric logistic model for the PS. For all estimates we use bootstrap to obtain SEs.

We then increase the covariate space to include all the two-way interactions between the covariates, excluding the exposure and the outcome, resulting in a total of 595 covariates. Using double-lasso on this extended covariate space, we select 156 covariates for the outcome and 89 for the treatment model, leaving us a union set used for confounding control of 211. We repeat the three estimators now based on this expanded control set.

Table 5 reports the estimated ATEs and 95% CIs. The CIs for the double-lasso outcome regression were obtained using the nonparametric bootstrap, while the IPTW and AIPTW were obtained as before using the sandwich SEs. The top panel of the table shows estimates using all covariates but no interactions, and the bottom panel shows estimates using 72 covariates, including the main terms and the interactions. The point estimates change very little, implying a minor role of the interactions in controlling for confounding.

#### Collaborative TMLE

Covariate selection for the PS can also be done within the TMLE framework. Previously, we have seen that for a standard TMLE estimator of the ATE, the estimation of the PS model is performed independently from the estimation of the initial estimator of the outcome model, that is, without seeking to optimize the fit of the PS for the estimation of the target parameter.

#### Table 5: ATE Post-Double-Lasso for Selection of Confounders Applied to the Birth Data

ATE

95 % CI

Outcome Regression

0.027

0.012

0.041

IPTW

0.083

0.016

0.107

AIPTW

0.066

0.029

0.103

With two-way interactions

Outcome Regression

0.032

0.016

0.046

IPTW

0.078

0.019

0.107

AIPTW

0.063

0.026

0.101

However, it is possible, and even desirable, to choose the treatment model that is optimized to reduce the MSE of the target parameter. An extension of the TMLE framework, the so-called collaborative TMLE (CTMLE) does just this.

The original version of CTMLE (van Der Laan & Gruber, 2010) is often referred to as “greedy CTMLE.” Here, a sequence of nested logistic regression PS models is created by a greedy forward stepwise selection algorithm, nested such that at each stage (i.e., among all PS models with $k$ main terms), we select the PS model that results in the estimated MSE of the ATE parameter being the smallest, when used in the targeting step to update the initial estimator of the outcome model. If the resulting TMLE does not improve upon the empirical fit of the initial outcome regression, then this TMLE is not included in the sequence of TMLEs. Instead, the initial outcome regression estimator is replaced by the last previously accepted TMLE and we start over. This procedure is performed iteratively until all d covariates have been incorporated into the PS model. The greedy nature of the algorithm makes it computationally intensive: the total number of models explored is $d+(d−1)+(d−2)+…+1$, which is of the order $O(d2)$ (Ju et al., 2019a).

This has led to the development of scalable versions (Ju et al., 2019b) of CTMLE, which replace the greedy search with a data-adaptive preordering of the candidate PS model estimators. The time burden of these scalable CTMLE algorithms is of order $O(d)$. Two CTMLE preordering strategies are proposed: logistic and correlation. The logistic preordering CTMLE constructs a univariable estimator of the PS for each available covariate, $pk$ with $Xk$ the baseline variable, $k=1,…,d$. Using the resulting predicted PS, we construct the clever covariate corresponding to the TMLE for the ATE, namely $h(A,k)=Apk−(1−A)(1−pk)$, and fluctuate the initial estimate of the parameter of interest $μ0$ using this clever covariate (and a logistic log-likelihood) as usual in the TMLE literature. We obtain the targeted estimate and compute the empirical loss, which could be, for instance, the mean sum of squared errors. Finally we order the covariates by increasing value of this loss.

The correlation preordering is motivated by noting that we would like the k-th covariate added to the PS model to be that which best explains the current residual, that is, between Y and current targeted estimate. Thus, the correlation preordering ranks the covariates based on their correlation with the residual between Y and the initial outcome regression estimate $μ0$ (Ju et al., 2019a).

For both preorderings, at each step of the CTMLE, we add the variables to the PS prediction model in this order, as long as the value of the loss function continues to decrease. Another version of the CTMLE exploits lasso regression for the selection of the variables to be included in the PS estimation (Ju et al., 2019b). This CTMLE algorithm also constructs a sequence of PS estimators, each of them a lasso logistic model with a penalty $λk$, where k is monotonically decreasing, and which is “initialized” with $λ1$ the minimum $λ$ selected by cross-validation. Then, the corresponding TMLE estimator for the ATE is constructed for each PS model, finally choosing by cross-validation the TMLE that minimizes the loss function (MSE). The SEs for all CTMLE versions are computed based on the variance of the IF, as implemented in the R package ctmle.

##### CTMLE Applied to the Birth Data

We now apply all these variants of CTMLE to the case study. The final logistic preordering CTMLE is based on a PS model containing nine covariates, including variables indicating the year of birth and variables capturing education and socioeconomic status. The CTMLE based on correlation preordering selected three covariates for the PS, one variable that captures participation in a social assistance program, but also two variables that measure the availability of healthcare providers in the community. These latter variables are indeed expected to have a strong association with the outcome, assisted birth, hence it is not surprising that they were selected, based on their role in reducing the MSE of the ATE. Finally, the lasso CTMLE is based on a penalty term of 0.000098, chosen by cross-validation, which results in all variables having nonzero coefficients for the PS.

The results are reported on Table 6. The estimated ATEs are somewhat larger than those obtained using TMLE, all reporting an increase of around 8% in the probability of giving birth assisted by a healthcare professional, for those who have social health insurance vs. uninsured. However, the CTMLE estimate resulting from the lasso, which selected all variables into the PS model, is (unsurprisingly) very similar to the TMLE estimate reported in “Double-Robust Estimation With Machine Learning,” and thus is further away from the naive estimate than the estimates from CTMLEs that use only a subset of the variables. Lasso–CTMLE also has wider 95% CIs than the rest, while the greedy CTMLE has the tightest. This may be empirical evidence of the bias–variance trade-off: due to using fewer covariates in the PS, the estimators that use aggressive variable selection for the PS are slightly biased, but with lower variance.

#### Table 6: ATE Obtained With CTML Estimators Applied to the Birth Data

ATE

95% CI

Greedy CTMLE

0.082

(0.071, 0.094)

Scalable CTMLE logistic preordering

0.085

(0.067, 0.104)

Scalable CTMLE correlation preordering

0.082

(0.068, 0.097)

Lasso CTMLE

0.076

(0.048, 0.105)

### Further Topics

Throughout this chapter, we have seen how ML can be used in estimating ATE, by using it as a prediction tool for outcome regression or PS models. The same logic can be applied to many other estimation problems where there are nuisance parameters that need to be predicted as part of the estimation of the parameter of interest.

#### Estimating Treatment Effect Heterogeneity

We focused the discussion to the common target parameter, the ATE. Most of the methods considered are also available for the ATT parameter (see, e.g., Chernozhukov et al., 2018). The difference between the ATE and ATT stems from heterogeneous treatment effects, and this heterogeneity—in particular, heterogeneity with respect to observed covariates—in itself can be an interesting target of causal inference. For example, in looking at the birth cohort example, we may be interested in how the effect of health insurance varies by the socioeconomic gradient. To answer this question, one may either want to specify the “treatment effect function,” a possibly nonparametric function of the treatment effects as a function of deprivation, and possibly other covariates, such as age. Another approach may be subgroup analysis, based on variables that have been selected in a data-adaptive way.

Imai and Ratkovic (2013) propose a variable selection method using Support Vector Machines, to estimate heterogeneous treatment effects in randomized trials. Hahn et al. (2017) further develop the BART framework to estimate treatment effect heterogeneity, by flexibly using the estimated PS to correct for regularization bias. Athey and Imbens (2016) propose a regression tree method, referred to as “causal trees,” to identify subgroups with treatment effect heterogeneity, using sample-splitting to avoid overfitting. Wager and Athey (2018) extend this idea to random forest-based algorithms, referred to as “causal forests,” for which they establish theoretical properties. A second interesting question may concern optimal policies: if a health policy-maker has limited resources to provide free insurance for those currently uninsured, what would be an optimal allocation mechanism that maximizes health benefits? The literature on “optimal policy learning” is rapidly growing. Kitagawa and Tetenov (2017) focus on estimating optimal policies from a set of possible policies with limited complexity, while Athey and Wager (2017) further develop the DML (Chernozhukov et al., 2018) to estimate optimal policies. Further approaches have been proposed, for example (Kallus, 2017) based on balancing and within the TMLE framework (Luedtke & Chambaz, 2017; van der Laan & Luedtke, 2015).

#### Instrumental Variables

In certain situations, even after adjusting for observed covariates, there may be doubts as to whether conditional exchangeability holds. However, other methods can be used where there is an instrumental variable (IV) available, that is, a variable that is correlated with the exposure but is not associated with any confounder of the exposure–outcome association, nor is there any pathway by which the IV affects the outcome, other than through the exposure. Depending on the additional assumptions the analyst is prepared to make, different estimands can be identified. Here, we focus on the local average treatment effect (LATE) under monotonic treatment (Angrist, Imbens, & Rubin, 1996).

Consider the (partially) linear instrumental variable model, which in its simpler form can be thought of as a two-stage procedure, where the first stage consists of a linear regression of the endogenous exposure A on the instrument Z, $A=α0+αZ+εa$. Then in a second stage, we regress the outcome on the predicted exposure $A^$, $Y=β0+β1A^+εy$.

Usually, the first stage is treated as an estimation step, and the coefficients are obtained using OLS. In fact, we are only interested in the predicted exposure for each observation, and the parameters in the first stage are merely nuisance parameters that must be estimated to calculate the fitted values for exposure. Thus, we can think of this problem directly as a prediction problem and use ML algorithms for the first stage. This can help alleviate some of the finite sample bias, often observed in IV estimates, which are typically biased toward the OLS, as a consequence of overfitting the first-stage regression, a problem that is more serious with small sample sizes or weak instruments (Mullainathan & Spiess, 2017).

A number of recent studies have used ML for the first stage of the IV models. Belloni, Chen, Chernozhukov, and Hansen (2012) use lasso, while Hansen and Kozbur (2014) use ridge regression. More recently, a TMLE has been developed for IV models, which uses ML fits also for the initial target parameter estimation (Tóth & van der Laan, 2016). Double-robust IV estimators can also be used with ML predictions for the nuisance models in the second stage, as shown in Chernozhukov et al. (2018) and DiazOrdaz, Daniel, and Kreif (2018).

### Discussion

This article provides an overview of the current use of ML methods for causal inference, in the setting of evaluating the average treatment effects of binary static treatments, assuming no unobserved confounding. We used a case study of an evaluation of a health insurance scheme on healthcare utilization in Indonesia. The case study displayed characteristics typical of applied evaluations: a binary outcome and a mixture of binary, categorical, and continuous covariates. A practical limitation of the presented case study is the presence of missing data. For simplicity, we used a complete case analysis, assuming that missingness does not depend on both the treatment and the outcome (Bartlett et al., 2015). If this is not the case, the resulting estimates may be biased. Several options to handle missing data under less restrictive assumptions exist: for example, multiple imputation (Rubin, 1987), which is in general valid under missing-at-random assumptions, or the missing indicator method, which includes indicators for missingness in the set of potential variables to adjust for, and relies on the assumption that the confounders are only such when observed (D’Agostino, Lang, Walkup, Morgan, & Karter, 2001). Another alternative is to use inverse probability of “being a complete case” weights, which can be easily combined with many of the methods described in this article (Seaman & White, 2014).

We highlight the limitations of naively interpreting the output of ML prediction methods as causal estimates and provide a review of recent innovations that plug in ML prediction of nuisance parameters in ATE estimators. We demonstrate how ML can make the estimation of the PS more principled and also illustrate a multivariate matching approach that uses ML to data-adaptively select balanced comparison groups. We also highlight the limitations of such “design-based” approaches: they may not improve balance on variables that really matter to reduce bias in the estimated ATE, as they cannot easily take into account information on the relative importance of confounders for the outcome variable.

We give a brief overview of the possibility of using ML for estimating ATEs via outcome regressions. We emphasize that obtaining valid confidence intervals after such procedures is complicated, and the bootstrap is not valid. Some methods, such as BARTs, are able to give inferences based on the corresponding posterior distributions and have been used in practice with success (Dorie et al., 2017). Nevertheless, there are currently no theoretical results underpinning its use (Wager & Athey, 2018), and thus BART inferences should be used with caution. Instead, we illustrate DR approaches that combine the strengths of PS estimation and outcome modeling, and are able to incorporate ML predictors in a principled way. These approaches, specifically TMLE pioneered by van der Laan and colleagues and the DML estimators developed by Chernozokov and colleagues, have appealing theoretical properties and increasing evidence of their good finite sample performance (Dorie et al., 2017; Porter et al., 2011). All estimation approaches demonstrated in this article rely on the assumption that selection into treatment is based on observable covariates only. In many settings of policy evaluations, this assumption is not tenable. Under such settings, beyond IV methods, panel data approaches are commonly used to control for one source of unobserved confounding, that is, due to unobservables that remain constant over time. To date, ML approaches have not been combined with panel data econometric methods. Exceptions are Bajari, Nekipelov, Ryan, and Yang (2015) and Chernozhukov et al. (2017) who demonstrate ML approaches for demand estimation using panel data.

We stress once again that ML methods can improve the estimation of causal effects only once the identification step has been firmed up and using estimators with appropriate convergence rates, so that they remain consistent even when using ML fits. However, with the increasing availability of big data, in particular in settings with a very large number of covariates, assumptions such as “no unobserved confounders” may be more plausible (Titiunik, 2015). With such d » n data sets, ML methods are indispensable for variable selection as well as the construction of low dimensional parameters, such as ATEs. Indeed, many innovations in ML for causal inference are taking place in such d » n settings (e.g., Belloni & Chernozhukov, 2011; Belloni et al., 2014; Wager & Athey, 2018).

Finally, we believe that, paradoxically, ML methods, which are often criticized for their “black box” nature, may increase the transparency of applied research. In particular, ensemble learning algorithms, such as SL, can provide a safeguard against having to hand-pick the best model or algorithm. ML should be encouraged to enhance expert substantive knowledge when selecting confounders and model specification.

### Acknowledgments

Karla DiazOrdaz was supported by the U.K. Wellcome Trust Institutional Strategic Support Fund–LSHTM Fellowship 204928/Z/16/Z.

Noémi Kreif gratefully acknowledges her co-authors on the impact evaluation of the Indonesian public health insurance scheme: Andrew Mirelman, Rodrigo Moreno-Serra, Marc Suhrcke (Centre for Health Economics, University of York), and Budi Hidayat (University of Indonesia).