Why testing conditional independence is so hard?

Conditional independence (CI) testing is widely used in causal discovery, scientific modeling, fairness, domain generalization, and robustness analysis. And yet it often fails in practice.

Why?

Let’s unpack the story.

Background

Before we talk about hardness, we need to understand what conditional independence really means — and how we try to measure it.

Conditional independence

What does conditional independence between \(A\) and \(B\) given \(C\) mean?

At a high level:

Once we account for \(C\), does \(B\) still tell us anything about \(A\)?

Two concrete examples help build intuition.

Example 1: fairness in lending

Race and loan decisions may be correlated marginally.
But once we control for creditworthiness, race should no longer influence the decision.

If it still does, the system may be unfair.

Example 2: distribution shift

A model might use “time of day” as a shortcut for predicting location.
If so, its predictions will fail when time distributions shift.

If predictions remain independent of time once we condition on true location, the model is robust.

Formal definition

Conditional independence is defined as

\[A \perp\!\!\!\perp B \mid C \quad \Longleftrightarrow \quad P_{A,B \mid C} = P_{A \mid C} P_{B \mid C}.\]

Once \(C\) is fixed, knowing \(B\) provides no additional information about \(A\).

Equivalently, for almost every value of \(C\),

\[\text{Cov}(f(A), g(B) \mid C) = 0\]

for all measurable functions \(f\) and \(g\).

This functional view turns out to be crucial.

CI as a hypothesis testing problem

At its core, hypothesis testing asks:

In CI testing:

\[H_0: A \perp\!\!\!\perp B \mid C\] \[H_1: A \not\!\perp\!\!\!\perp B \mid C\]

We compute a statistic from data. If it is too extreme, we reject \(H_0\).

Two types of error can occur:

A good test aims to:

For many classical testing problems, this tradeoff is manageable.

For conditional independence, both errors are unusually difficult to control simultaneously.

Measuring conditional independence

A characterization of conditional independence is: \(A \perp\!\!\!\perp B \mid C\) if and only if, for all square-integrable functions \(f \in L_A^2\), \(g \in L_B^2\), \(w \in L_C^2\),

\[\mathbb{E}_{C}\Big[ w(C) \, \mathbb{E}_{AB|C}\Big[ \big(f(A)- \mathbb{E}[f(A)|C]\big) \big(g(B)- \mathbb{E}[g(B)|C]\big) \Big] \Big] = 0.\]

Let’s unpack this.

If any conditional covariance remains on a region of \(C\) with non-negligible probability, an appropriate \(w(C)\) will detect it.

So conditional independence means:

No residual dependence remains after removing the effect of \(C\).

RKHS view: from functions to operators

It is impossible to test all square-integrable functions, thus we use functions in a Reproducing Kernel Hilbert Space (RKHS).

An RKHS \(\mathcal{H}_A\) contains functions that are linear w.r.t. \(\phi_A(a)\)

\[f(a) = \langle w, \phi_A(a) \rangle,\]

where \(\phi_A\) is a feature map.

Define the conditional mean embedding

\[\mu_{A|C}(c) = \mathbb{E}[ \phi_A(A) \mid C = c ].\]

It satisfies

\[\langle \mu_{A|C}(c), f \rangle_{\mathcal{H}_A} = \mathbb{E}[ f(A) \mid C = c ].\]

So conditional expectations become inner products in Hilbert space.

We define the conditional cross-covariance operator

\[\mathcal{C}_{AB|C}(c) = \mathbb{E}_{AB|C}\Big[ \big(\phi_A(A)-\mu_{A|C}(c)\big) \otimes \big(\phi_B(B)-\mu_{B|C}(c)\big) \mid C=c \Big].\]

This operator satisfies

\[\langle f \otimes g, \mathcal{C}_{AB|C}(c) \rangle = \text{Cov}(f(A), g(B) \mid C=c).\]

So it encodes all conditional covariances.

The KCI operator

To aggregate over \(C\), we define

\[\mathcal{C}_{\text{KCI}} = \mathbb{E}_C\Big[ \mathcal{C}_{AB|C}(C) \otimes \phi_C(C) \Big].\]

For any test functions \(f, g, w\),

\[\langle f \otimes g, \mathcal{C}_{\text{KCI}}\, w \rangle = \mathbb{E}_C\Big[ w(C) \, \mathbb{E}_{AB|C} \Big[ (f(A)-\mathbb{E}[f(A)|C]) (g(B)-\mathbb{E}[g(B)|C]) \Big] \Big].\]

If the RKHSs \(\mathcal{H}_A, \mathcal{H}_B, \mathcal{H}_C\) are sufficiently rich (e.g., \(L^2\)-universal), then

\[\mathcal{C}_{\text{KCI}} = 0 \quad \Longleftrightarrow \quad A \perp\!\!\!\perp B \mid C.\]

A common test statistic is

\[\text{KCI} = \|\mathcal{C}_{\text{KCI}}\|_{\text{HS}}^2.\]

So the problem reduces to:

Estimate this operator from finite samples and determine whether it is zero.

That is where the real difficulty begins.

Why CI testing is fundamentally hard

The binary embedding trick

Now comes the key construction.

Start with any distribution of scalars \(A, B, C\) such that

\[A \not\perp B \mid C.\]

So conditional dependence genuinely exists.

Now perform the following transformation.

  1. Write the binary expansions of \(A, B, C\).
  2. Truncate each to 100 bits:
    \(A_{100}, \quad B_{100}, \quad C_{100}.\)
  3. Construct a new conditioning variable by embedding the truncated bits of \(A\) into \(C\) via concatenation:

  4. Sample scalars \(A, B, C\) where \(A \not\!\perp\!\!\!\perp B \mid C\)
  5. Take binary expansions of \(A, B, C\).
  6. Truncate each to 100 bits: \(A_{100}, \quad B_{100}, \quad C_{100}.\)
  7. Embed \(A_{100}\) into \(C_{100}\) by concatenation.

For example:

The the new \(C\) is:

\[C_\text{new} = ((C_{100} \text{ bits}) || (A_{100} \text{ bits})) = 10011001...10111100...\]

So the new conditioning variable contains:

Finally, add an arbitrarily small continuous noise to all binary variables so the joint distribution remains absolutely continuous.

What just happened?

After this construction:

As a result,

\[A_\text{new} \perp\!\!\!\perp B_\text{new} \mid C_\text{new}.\]

The conditional dependence has disappeared.

The crucial insight:

Nothing dramatic happened to the distribution at a coarse scale.

The only change was that extremely fine-grained information about \(A\) was embedded in the tail digits of \(C\).

To detect this transformation, a test would need effectively infinite precision — it would have to examine arbitrarily fine features of the joint distribution.

No finite-sample test can reliably do this.

That is the essence of the impossibility result:

Evidence for conditional independence can be hidden in arbitrarily subtle features of the distribution.

And no finite dataset can rule out such constructions.

The Shah–Peters impossibility theorem

This construction is not just a clever trick. It reflects a deep structural limitation formalized by Shah and Peters (2020).

For any finite-sample conditional independence test, and for any alternative distribution where \(A \not\!\perp\!\!\!\perp B \mid C\), there exists a null distribution (where \(A \perp\!\!\!\perp B \mid C\)) that the test cannot reliably distinguish from that alternative.

More concretely:

In other words, no CI test can be uniformly valid over all continuous distributions.

There is no procedure that simultaneously:

This is not a shortcoming of current algorithms.

It is a fundamental limitation of the problem itself.

Why it still fails in practice

You might think: these are adversarial constructions, surely in practice we don’t encounter them.

Correct — we rarely face carefully engineered binary embedding tricks.

But the mechanism behind the impossibility result is not artificial.

The core issue is this:

Conditional dependence can live in structured, localized, or oscillatory features of the distribution.

And detecting those features from finite samples is fundamentally delicate.

Let’s see how this shows up in a realistic setting.

A realistic example

Consider

\[A = f_A(C) + r_A\] \[B = f_B(C) + r_B\]

where

\[(r_A, r_B) \mid C \sim \mathcal{N} \left( 0, \begin{pmatrix} 1 & \gamma(C) \\ \gamma(C) & 1 \end{pmatrix} \right).\]

Here:

Now define:

So under the alternative, the residual correlation oscillates smoothly as a function of \(C\).

Why this is subtle?

Notice something important.

The marginal residual correlation is

\[\mathbb{E}[\gamma(C)] = \mathbb{E}[\sin(C)].\]

Since \(C \sim \mathcal{N}(0,1)\) and \(\sin(\cdot)\) is symmetric,

\[\mathbb{E}[\sin(C)] = 0.\]

So globally, the residual correlation averages out.

Marginally, there is no detectable correlation.

The dependence only appears locally in regions of \(C\).

Marginal and conditional covariance of A and B given C.

This is the key difficulty:

Conditional dependence may oscillate, cancel globally, and only be visible at the right scale.

This tension already makes detecting subtle dependence fragile.

But CI testing is hard for a deeper reason.

It is not only difficult to detect true conditional dependence — it is also dangerously easy to detect dependence that is not actually there.

If we look at the wrong structure (for example, use the wrong kernel scale), we may smooth away real dependence and fail to detect it.

If the conditional means are estimated inaccurately, we may introduce artificial residual correlation and falsely conclude that dependence exists.

In short:

The same procedure can both overlook real dependence and hallucinate spurious dependence.

That dual instability is what makes conditional independence testing fundamentally delicate.

When conditional means are perfect

To understand where things go wrong in practice, let’s first look at the idealized setting — where we know the conditional means exactly.

When linear kernels are used for variables A and B, i.e., \(\phi_A(A)=A\) and \(\phi_B(B)=B\), the Kernel-based Conditional Independence (KCI) can be understood in three conceptual steps.

  1. Get the perfect conditional means: \(\mu_{A|C}(c) = \mathbb{E}[A \mid C=c], \qquad \mu_{B|C}(c) = \mathbb{E}[B \mid C=c].\)
  2. Form residuals by removing the effect of \(C\): \(R_A = A - \mu_{A|C}(C), \qquad R_B = B - \mu_{B|C}(C).\)
  3. Measure whether the residuals are still dependent, using a kernel on \(C\) to localize the comparison across different regions of the conditioning variable.

Intuitively, KCI asks: after removing everything that can be explained by \(C\), is there any remaining dependence between \(A\) and \(B\)?

In this ideal infinite-sample regime, where the conditional means are known perfectly:

In this idealized setting:

In other words, if the conditional means were known exactly, CI testing would be a well-behaved problem. The only difficulty would be statistical power: do we have enough data and the right resolution to detect the existing dependence?

The real trouble begins when we have to estimate those conditional means from data.

When we have to estimate the conditional means

Everything above assumed we knew the true conditional means

In practice, we do not.

We estimate them from data, typically using kernel ridge regression .

Write the estimators as

\[\hat{\mu}_{A|C}(c) = \mu_{A|C}(c) + \delta_{A|C}(c),\] \[\hat{\mu}_{B|C}(c) = \mu_{B|C}(c) + \delta_{B|C}(c),\]

where \(\delta_{A\mid C}\) and \(\delta_{B\mid C}\) are the regression errors.

What happens to the residuals?

The empirical residuals are now

\[\hat{R}_A = A - \hat{\mu}_{A|C}(C), \qquad \hat{R}_B = B - \hat{\mu}_{B|C}(C).\]

Substituting,

\[\hat{R}_A = (A - \mu_{A|C}(C)) - \delta_{A|C}(C),\] \[\hat{R}_B = (B - \mu_{B|C}(C)) - \delta_{B|C}(C).\]

Under our generative model,

\[A - \mu_{A|C}(C) = r_A, \qquad B - \mu_{B|C}(C) = r_B.\]

So

\[\hat{R}_A = r_A - \delta_{A|C}(C), \qquad \hat{R}_B = r_B - \delta_{B|C}(C).\]

Now multiply:

\[\hat{R}_A \hat{R}_B = r_A r_B - r_A \delta_{B|C}(C) - r_B \delta_{A|C}(C) + \delta_{A|C}(C)\delta_{B|C}(C).\]

Taking conditional expectation given \(C\) (and assuming regression is trained on independent data so errors are fixed w.r.t. test sample),

\[\mathbb{E}[\hat{R}_A \hat{R}_B \mid C] = \mathbb{E}[r_A r_B \mid C] + \delta_{A|C}(C)\delta_{B|C}(C).\]

But

\[\mathbb{E}[r_A r_B \mid C] = \gamma(C).\]

So we obtain

\[\mathbb{E}[\hat{R}_A \hat{R}_B \mid C] = \gamma(C) + \delta_{A|C}(C)\delta_{B|C}(C).\]

Under \(H_0\), \(\gamma(C) = 0.\) So the population residual covariance becomes

\[\mathbb{E}[\hat{R}_A \hat{R}_B \mid C] = \delta_{A\mid C}(C)\delta_{B\mid C}(C).\]

Even though \(A \perp B \mid C\) holds in truth.

When KCI aggregates over \(C\) using the kernel,

\[\text{KCI} = \mathbb{E} \left[ k_C(C,C') \, \delta_{A\mid C}(C) \delta_{A\mid C}(C') \delta_{B\mid C}(C) \delta_{B\mid C}(C') \right].\]

So the regression errors induce a nonzero population statistic.

This is not sampling noise.

It is structural bias introduced by imperfect regression.

Type I error inflation and type I/II tradeoff

Why type I error explodes

Under ideal conditions (perfect regression), the KCI statistic behaves like a degenerate U-statistic under the null.

Because the population statistic is exactly zero under \(H_0\), the first-order term of the U-statistic vanishes. This degeneracy forces the variance to decay at rate \(1/n\).

Null approximations — whether chi-square mixtures, Gamma approximations, or wild bootstrap — rely critically on this structure.

They assume:

When conditional means are estimated imperfectly.

Under \(H_0\), the statistic no longer has zero population mean:

Because once the statistic has nonzero mean, the U-statistic is no longer degenerate. Its leading term behaves like an empirical average of nonzero quantities.

Formally:

So instead of shrinking toward zero, the statistic concentrates around a positive constant:

\[\text{KCI}_n = \text{Bias} + O_p(1/\sqrt{n}).\]

Unless the regression error itself shrinks sufficiently fast, the bias remains.

Case Mean of KCI Std Dev
Perfect regression 0 \(O(1/n)\)
Imperfect regression \(O(1)\) \(O(1/\sqrt{n})\)

The consequence:

Null calibration procedures still assume a centered statistic.

But the true distribution is shifted.

As test sample size \(n\) grows:

Eventually, the statistic will almost surely exceed any fixed null threshold.

Type I error thus inflate with increasing \(n\).

This is why regression error is not a small nuisance.

It fundamentally changes the asymptotic regime.

Type I and type II error tradeoff

In principle, we choose the kernel (especially the bandwidth on \(C\)) to maximize power — that is, to better detect conditional dependence.

But here is the subtle danger:

The same kernel choice that amplifies true dependence can also amplify regression-induced bias.

Recall that under imperfect regression, the null statistic contains the term

\[\delta_{A\mid C}(C)\,\delta_{B\mid C}(C).\]

These regression errors are not arbitrary noise. They are smooth, structured functions of \(C\).

When we optimize the kernel bandwidth to increase sensitivity to dependence, we are effectively choosing a weighting function over \(C\).

If that weighting aligns with regions where \(\delta_{A\mid C}(C)\,\delta_{B\mid C}(C)\) is large, the test statistic increases — even under the null.

In other words:

Type I and type II error tradeoff when selecting bandwidth for kernel C. Small training size corresponds to worse conditional mean estimates.

This creates a fundamental tradeoff:

Optimizing for power can therefore push us directly into spurious rejection.

Conditional independence testing is fragile not only because dependence may be subtle,
but because the very act of searching for it can create it.

The central lesson

CI testing does not merely require detecting dependence.

It requires:

Estimating conditional means accurately enough that regression error does not masquerade as conditional dependence.

That is an extremely strong requirement.

And that is why CI testing fails in practice.

Final Summary

Takeaways: why CI is hard both theoretically and practically

Conditional independence testing is difficult for structural, not accidental, reasons:

  1. Dependence can hide in subtle structure.
    Conditional dependence may be localized, oscillatory, or globally canceling — detectable only at the right scale.

  2. The kernel on \(C\) controls what structure is visible.
    Over-smoothing misses real dependence (Type II error).
    Over-localizing amplifies noise and instability.

  3. Regression error induces spurious dependence.
    Imperfect estimation of \(\mathbb{E}[\phi_A(A) \mid C]\) and \(\mathbb{E}[\phi_B(B) \mid C]\) introduces artificial residual correlation, even under the null.

  4. Kernel selection can overfit regression bias.
    Optimizing the kernel on \(C\) for power may align the test with structured regression error rather than true signal.

  5. Null approximations rely on ideal asymptotics.
    When regression error prevents degeneracy, the statistic is no longer centered at zero, and Type I error can inflate dramatically.

In short:

CI testing is hard not just because dependence is difficult to detect,
but because false dependence is easy to create.

Practical recommendations:

Though the theory is pessimistic, CI testing can still be useful in practice — if we treat it as a fragile procedure and design the pipeline accordingly.

Paper: On the Hardness of Conditional Independence Testing In Practice

Code: github.com/he-zh/kci-hardness