Conditional independence (CI) testing is widely used in causal discovery, scientific modeling, fairness, domain generalization, and robustness analysis. And yet it often fails in practice.
Why?
Let’s unpack the story.
Before we talk about hardness, we need to understand what conditional independence really means — and how we try to measure it.
At a high level, conditional independence between \(A\) and \(B\) given \(C\) asks:
Once we account for \(C\), does \(B\) still tell us anything about \(A\)?
Here are two concrete examples that help build intuition.
Race and loan decisions may be correlated marginally. But once we control for creditworthiness, race should no longer influence the decision. If it still does, the system may be unfair.
A model might use “time of day” as a shortcut for predicting location. If so, its predictions will fail when time distributions shift. If predictions remain independent of time once we condition on true location, the model is robust.
Conditional independence is defined as
\[A \perp\!\!\!\perp B \mid C \quad \Longleftrightarrow \quad P_{A,B \mid C} = P_{A \mid C} P_{B \mid C}.\]Once \(C\) is fixed, knowing \(B\) provides no additional information about \(A\).
Equivalently, for almost every value of \(C\),
\[\text{Cov}(f(A), g(B) \mid C) = 0\]for all measurable functions \(f\) and \(g\).
In other words, conditional independence requires checking that no dependence remains between \(A\) and \(B\) at almost every value of \(C\).
In hypothesis testing, we consider two hypotheses:
In CI testing, these become:
By default, we retain \(H_0\) and reject it only when the observed test statistic would be unlikely under the null.
In testing, two types of error can occur:
Type I error (false positive):
Rejecting \(H_0\) when conditional independence actually holds.
In CI testing, this means concluding that \(A\) and \(B\) are conditionally dependent when they are not.
Type II error (false negative):
Failing to reject \(H_0\) when conditional dependence actually exists.
In CI testing, this means missing real conditional structure.
A good test aims to:
For many classical testing problems, this tradeoff is manageable, e.g., independence testing. But for conditional independence, however, controlling both errors simultaneously is much more difficult. We will see why in detail later.
A characterization of conditional independence is: \(A \perp\!\!\!\perp B \mid C\) if and only if, for all square-integrable functions \(f \in L_A^2\), \(g \in L_B^2\), \(w \in L_C^2\),
\[\mathbb{E}_{C}\Big[ w(C) \, \mathbb{E}_{AB|C}\Big[ \big(f(A)- \mathbb{E}[f(A)|C]\big) \big(g(B)- \mathbb{E}[g(B)|C]\big) \Big] \Big] = 0.\]Intuition behind this CI measure:
If any conditional covariance remains on a region of \(C\) with non-negligible probability, an appropriate \(w(C)\) will detect it.
In principle, the condition involves all square-integrable functions, which is impossible to handle computationally.
Instead, we restrict the function class to a Reproducing Kernel Hilbert Space (RKHS), which provides a tractable and flexible approximation.
An RKHS \(\mathcal{H}_A\) contains functions that are linear w.r.t. \(\phi_A(a)\)
\[f(a) = \langle w, \phi_A(a) \rangle,\]where \(\phi_A\) is a feature map.
Define the conditional mean embedding
\[\mu_{A|C}(c) = \mathbb{E}[ \phi_A(A) \mid C = c ].\]It satisfies
\[\langle \mu_{A|C}(c), f \rangle_{\mathcal{H}_A} = \mathbb{E}[ f(A) \mid C = c ].\]So conditional expectations become inner products in Hilbert space.
We define the conditional cross-covariance operator
\[\mathcal{C}_{AB|C}(c) = \mathbb{E}_{AB|C}\Big[ \big(\phi_A(A)-\mu_{A|C}(c)\big) \otimes \big(\phi_B(B)-\mu_{B|C}(c)\big) \mid C=c \Big].\]This operator satisfies
\[\langle f \otimes g, \mathcal{C}_{AB|C}(c) \rangle = \text{Cov}(f(A), g(B) \mid C=c).\]So it encodes all conditional covariances.
To aggregate over \(C\), we define
\[\mathcal{C}_{\text{KCI}} = \mathbb{E}_C\Big[ \mathcal{C}_{AB|C}(C) \otimes \phi_C(C) \Big].\]For any test functions \(f, g, w\),
\[\langle f \otimes g, \mathcal{C}_{\text{KCI}}\, w \rangle = \mathbb{E}_C\Big[ w(C) \, \mathbb{E}_{AB|C} \Big[ (f(A)-\mathbb{E}[f(A)|C]) (g(B)-\mathbb{E}[g(B)|C]) \Big] \Big].\]If the RKHSs \(\mathcal{H}_A, \mathcal{H}_B, \mathcal{H}_C\) are sufficiently rich (e.g., \(L^2\)-universal), then
\[\mathcal{C}_{\text{KCI}} = 0 \quad \Longleftrightarrow \quad A \perp\!\!\!\perp B \mid C.\]A common test statistic
So the problem reduces to estimating this operator from finite samples and determining whether it is zero.
We first explain why conditional independence testing is theoretically hard — regardless of which test or CI measure is used.
Start with any distribution of scalars \(A, B, C\) such that
\[A \not\perp B \mid C.\]So conditional dependence genuinely exists.
Now perform the following transformation.
After this construction:
As a result,
\[A_\text{new} \perp\!\!\!\perp B_\text{new} \mid C_\text{new}.\]The conditional dependence has disappeared.
The insight here is:
Given any continuous distribution of \(A, B, C\), we can hide the dependence inside arbitrarily fine structure and construct a conditionally independent distribution that looks almost indistinguishable from the dependent one.
To detect this transformation, a test would need effectively infinite precision. Thus, no finite-sample test can reliably do this.
This construction is not just a clever trick. It reflects a deep structural limitation formalized by Shah and Peters (2020)
For any finite-sample conditional independence test, and for any alternative distribution where \(A \not\!\perp\!\!\!\perp B \mid C\), there exists a null distribution (where \(A \perp\!\!\!\perp B \mid C\)) that the test cannot reliably distinguish from that alternative.
More concretely:
In other words, no CI test can be uniformly valid over all continuous distributions. If a test controls Type I error at level \(\alpha\) for all nulls, then it cannot achieves nontrivial power against all alternatives.
And this is not a shortcoming of some algorithm but rather a fundamental limitation of the problem itself.
You might think: these are adversarial constructions, surely in practice we don’t encounter them.
Yeah, we rarely face carefully engineered binary embedding tricks.
But… the mechanism behind the impossibility result is not artificial.
The core issue is:
Conditional dependence can live in fine-grained structure,
and detecting (in)dependence from finite samples is very delicate.
To see how this arises in a realistic setting, consider a problem inspired by engineering diagnostics:
Suppose we have high-dimensional vibration data \(C\) collected from a mechanical system.
We want to know whether the behavior of component \(A\) is connected to that of component \(B\), after conditioning on the vibration signal.
In many systems:
Detecting these two phenomena requires very different scales of analysis.
To formalize this, consider the model
\[A = f_A(C) + r_A\] \[B = f_B(C) + r_B\]where
\[(r_A, r_B) \mid C \sim \mathcal{N} \left( 0, \begin{pmatrix} 1 & \gamma(C) \\ \gamma(C) & 1 \end{pmatrix} \right).\]Here:
Now define:
So under the alternative, the residual correlation oscillates smoothly as a function of \(C\).
Why this is subtle?
The marginal residual correlation is
\[\mathbb{E}[\gamma(C)] = \mathbb{E}[\sin(C)].\]Since \(C \sim \mathcal{N}(0,1)\) and \(\sin(\cdot)\) is symmetric,
\[\mathbb{E}[\sin(C)] = 0.\]So globally, the residual correlation averages out.
Marginally, there is no detectable correlation.
The dependence only appears locally in regions of \(C\).
This is the key difficulty:
Conditional dependence may oscillate, cancel globally, and only be visible at the right scale.
Methods that only measure averaged conditional dependence can completely miss structured, localized interactions. For instance, the Generalized Covariance Measure (GCM), which effectively averages conditional covariance over \(C\), can fail when dependence oscillates and cancels globally
The tension discussed above already makes detecting subtle dependence fragile.
But there is an even deeper difficulty.
Even if we choose the right scale to detect dependence, CI testing can still fail — because of how we estimate conditional means.
If we look at the wrong structure, we may neglect real dependence and fail to detect it.
If the conditional means are estimated inaccurately, we may introduce artificial residual correlation and falsely conclude that dependence exists.
To understand where things go wrong in practice, let’s first look at the idealized setting — where we know the conditional means exactly.
When linear kernels are used for variables A and B, i.e., \(\phi_A(A)=A\) and \(\phi_B(B)=B\), the Kernel-based Conditional Independence (KCI) can be understood in three conceptual steps.
Intuitively, KCI asks: after removing everything that can be explained by \(C\), is there any remaining dependence between \(A\) and \(B\)?
In this ideal infinite-sample regime, where the conditional means are known perfectly:
Under \(H_0\), \(\text{Cov}(R_A, R_B \mid C) = 0\) almost surely, so the KCI statistic converges to zero.
Under \(H_1\), residual dependence remains for some values of \(C\), and the kernel aggregation detects it.
In this idealized setting:
In other words, if the conditional means were known exactly, CI testing would be a well-behaved problem.
The real trouble begins when we have to estimate those conditional means from data.
Everything above assumed we knew the true conditional means.
In practice, we do not.
We estimate them from data, typically using kernel ridge regression
Write the estimators as
\[\hat{\mu}_{A|C}(c) = \mu_{A|C}(c) + \delta_{A|C}(c),\] \[\hat{\mu}_{B|C}(c) = \mu_{B|C}(c) + \delta_{B|C}(c),\]where \(\delta_{A\mid C}\) and \(\delta_{B\mid C}\) are the regression errors.
What happens to the residuals?
The empirical residuals are now
\[\hat{R}_A = A - \hat{\mu}_{A|C}(C), \qquad \hat{R}_B = B - \hat{\mu}_{B|C}(C).\]Substituting,
\[\hat{R}_A = (A - \mu_{A|C}(C)) - \delta_{A|C}(C),\] \[\hat{R}_B = (B - \mu_{B|C}(C)) - \delta_{B|C}(C).\]Under our generative model,
\[A - \mu_{A|C}(C) = r_A, \qquad B - \mu_{B|C}(C) = r_B.\]So
\[\hat{R}_A = r_A - \delta_{A|C}(C), \qquad \hat{R}_B = r_B - \delta_{B|C}(C).\]Now multiply:
\[\hat{R}_A \hat{R}_B = r_A r_B - r_A \delta_{B|C}(C) - r_B \delta_{A|C}(C) + \delta_{A|C}(C)\delta_{B|C}(C).\]Taking conditional expectation given \(C\) (and assuming regression is trained on independent data so errors are fixed w.r.t. test sample),
\[\mathbb{E}[\hat{R}_A \hat{R}_B \mid C] = \mathbb{E}[r_A r_B \mid C] + \delta_{A|C}(C)\delta_{B|C}(C).\]But
\[\mathbb{E}[r_A r_B \mid C] = \gamma(C).\]So we obtain
\[\mathbb{E}[\hat{R}_A \hat{R}_B \mid C] = \gamma(C) + \delta_{A|C}(C)\delta_{B|C}(C).\]Under \(H_0\), \(\gamma(C) = 0.\) So the population residual covariance becomes
\[\mathbb{E}[\hat{R}_A \hat{R}_B \mid C] = \delta_{A\mid C}(C)\delta_{B\mid C}(C).\]Even though \(A \perp B \mid C\) holds in truth.
When KCI aggregates over \(C\) using the kernel,
\[\text{KCI} = \mathbb{E} \left[ k_C(C,C') \, \delta_{A\mid C}(C) \delta_{A\mid C}(C') \delta_{B\mid C}(C) \delta_{B\mid C}(C') \right].\]So the regression errors induce a nonzero population statistic which leads to inflated Type I error.
This is not sampling noise.
It is structural bias introduced by imperfect regression.
Under ideal conditions (perfect regression), the KCI statistic behaves like a degenerate U-statistic under the null.
Because the population statistic is exactly zero under \(H_0\), the first-order term of the U-statistic vanishes. This degeneracy forces the variance to decay at rate \(1/n\).
Null approximations — whether chi-square mixtures, Gamma approximations, or wild bootstrap — rely critically on this structure.
They assume:
When conditional means are estimated imperfectly.
Under \(H_0\), the statistic no longer has zero population mean:
Because once the statistic has nonzero mean, the U-statistic is no longer degenerate. Its leading term behaves like an empirical average of nonzero quantities.
Formally:
So instead of shrinking toward zero, the statistic concentrates around a positive constant:
\[\text{KCI}_n = \text{Bias} + O_p(1/\sqrt{n}).\]Unless the regression error itself shrinks sufficiently fast, the bias remains.
| Case | Mean of KCI | Std Dev |
|---|---|---|
| Perfect regression | 0 | \(O(1/n)\) |
| Imperfect regression | \(O(1)\) | \(O(1/\sqrt{n})\) |
The consequence:
Null calibration procedures still assume a centered statistic.
But the true distribution is shifted.
As test sample size \(n\) grows:
Eventually, the statistic will almost surely exceed any fixed null threshold.
Type I error thus inflate with increasing \(n\).
This is why regression error is not a small nuisance.
It fundamentally changes the asymptotic regime.
In principle, we choose the kernel (especially the bandwidth on \(C\)) to maximize power — that is, to better detect conditional dependence. However,
The same kernel choice that amplifies true dependence can also amplify regression-induced bias.
Recall that under imperfect regression, the null statistic contains the term
\[\delta_{A\mid C}(C)\,\delta_{B\mid C}(C).\]These regression errors are not arbitrary noise. They are smooth, structured functions of \(C\).
When we optimize the kernel bandwidth to increase sensitivity to dependence, we are effectively choosing a weighting function over \(C\).
If that weighting aligns with regions where \(\delta_{A\mid C}(C)\,\delta_{B\mid C}(C)\) is large, the test statistic increases — even under the null.
In other words:
This creates a fundamental tradeoff:
Optimizing for power can therefore push us directly into spurious rejection.
Conditional independence testing is fragile not only because dependence may be subtle,
but because the very act of searching for it can create it.
CI testing does not merely require detecting dependence. It also requires estimating conditional means accurately enough.
That is a very strong requirement.
And that is why CI testing fails in practice.
Conditional independence testing is difficult for structural reasons:
Dependence can hide in subtle structure.
Conditional dependence may be localized, oscillatory, or globally canceling — detectable only at the right scale.
Regression error induces spurious dependence.
Imperfect estimation of \(\mathbb{E}[\phi_A(A) \mid C]\) and \(\mathbb{E}[\phi_B(B) \mid C]\) introduces artificial residual correlation, even under the null.
Detection is unstable in both directions.
Conditional dependence is not only hard to detect (high Type II error), it is also easy to hallucinate when conditional means are estimated inaccurately (inflated Type I error).
Moreover, these phenomena are not specific to KCI. They arise in essentially all conditional dependence measures that rely on estimating conditional expectations, and aggregate residual covariances across values of \(C\).
Though the theory is pessimistic, CI testing can still be useful in practice — if we treat it as a fragile procedure and design the pipeline accordingly.
Sample splitting.
Use an independent training set to estimate the conditional means \(\hat{\mu}_{A|C}\) and \(\hat{\mu}_{B|C}\), and a separate test set to compute the KCI statistic and calibrate the null.
As a rule of thumb, the training set should be at least as large as the test set, and preferably much larger (how much larger depends on the complexity of \(\mathbb{E}[A|C]\) and \(\mathbb{E}[B|C]\))
Strong regression (bias matters more than you think).
Use flexible, low-bias regression models for conditional mean estimation. CI testing is extremely sensitive to systematic regression error: even small structured bias can look like conditional dependence under \(H_0\).
Kernel choice for power (but only on the training split).
If you tune the conditioning kernel \(k_C\) (e.g., its bandwidth) to improve power, do it using only the training split—e.g., by maximizing an estimated signal-to-noise ratio (SNR). Then freeze the choice and evaluate on the test split.
Be cautious about “discoveries.”
Even with all safeguards, it is still easy to trick yourself: power tuning can overfit to regression artifacts, and null calibration can silently fail when regression error does not decay fast enough. Treat borderline rejections with skepticism and stress-test them with sensitivity checks.
If you find this blog helpful, please consider cite
@inproceedings{hehardness2025,
title={On the Hardness of Conditional Independence Testing In Practice},
author={He, Zheng and Pogodin, Roman and Li, Yazhe and Deka, Namrata and Gretton, Arthur and Sutherland, Danica J},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025}
}
Paper: On the Hardness of Conditional Independence Testing In Practice