Kernel methods are one of the most elegant bridges between linear algebra, functional analysis, and modern machine learning. Once you see the pattern—map into a Hilbert space, do linear operations there—everything from SVMs to distribution comparison becomes a special case.


1. Preliminaries: The Geometry of Functions

To understand kernels, we first need to define the spaces in which they live. We transition from finite-dimensional vectors to infinite-dimensional function spaces.

1.1 Function Spaces

A function space \(\mathcal{F}\) is a vector space where the “points” are functions.

Definition. Let \(f, f' \in \mathcal{F}\) correspond to weight vectors \(w, w'\). The operations are defined point-wis

  • Addition/Scaling: \((\alpha f + f')(x) = \alpha f(x) + f'(x)\).
  • Inner Product: \(\langle f, f' \rangle = w \cdot w'\).
  • Norm: \(\|f\|_{\mathcal{F}} = \sqrt{\langle f, f \rangle} = \|w\|\).

1.2 The \(L^2(X, \mu)\) Space

In probability and measure theory, we often care about functions whose “size” is finite under an integral.

Definition. \(L^2(X, \mu)\) consists of all measurable functions \(f\) such that:

\[\|f\|_2 = \left( \int_X |f(x)|^2 d\mu(x) \right)^{1/2} < \infty\]

1.3 Hilbert Space

A Hilbert space is essentially a vector space equipped with an inner product that is complete (meaning all Cauchy sequences converge within the space).

Cauchy Sequence: For every \(\epsilon > 0\), there exists \(N\) such that \(\|f_n - f_m\|_{\mathcal{F}} < \epsilon\) for all \(m, n \ge N\).

Completeness: Ensures we can perform calculus and optimization without “falling out” of the space.


2. Kernels and the RKHS

The “Kernel Trick” allows us to compute inner products in high-dimensional feature spaces without ever explicitly computing the coordinates of the data.

2.1 What is a Kernel?

Definition. A function \(k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) is a positive definite kernel if there exists a Hilbert space \(\mathcal{H}\) and a feature map \(\phi(x): \mathcal{X} \to \mathcal{H}\) such that:

\[k(x, x') = \langle \phi(x), \phi(x') \rangle_{\mathcal{H}}\]

The Mercer Condition: For \(k\) to be a valid kernel, the kernel matrix \(K\) (where \(K_{ij} = k(x_i, x_j)\)) must be Symmetric Positive Semi-Definite (SPSD).

This means: \(\forall \alpha \in \mathbb{R}^m, \alpha^\top K \alpha \ge 0\).

All eigenvalues of \(K\) are non-negative.

In machine learning, commonly used kernels include the Gaussian and Laplace kernels, i.e.,

\[k(x,x')=\exp\left(-\frac{\| X-x'\|^2_2}{2\sigma^2}\right), \qquad k(x,x')=\exp\left(-\frac{\| X-x'\|_1}{\sigma}\right)\]

where \(\sigma > 0\) is a bandwidth parameter. These kernels belong to a class of kernel functions called radial basis functions (RBF).

Universal and characteristic kernel (reference)

Universal and characteristic—have been developing in parallel in machine learning: universal kernels are proposed in the context of achieving the Bayes risk by kernel-based classification/regression algorithms while characteristic kernels are introduced in the context of distinguishing probability measures by embedding them into a reproducing kernel Hilbert space (RKHS).

2.2 Reproducing Kernel Hilbert Space (RKHS)

The RKHS is a special Hilbert space where evaluation is a bounded linear functional.

Evaluation Functional:

  • For each \(x\in X\), the evaluation functional \(Lx\)​ is a map that takes a function \(f∈H\) and returns its value at 𝑥: \(Lx​(f)=f(x)\)
  • This functional 𝐿𝑥​ is linear, meaning: \(Lx​(f+g)=Lx​(f)+Lx​(g)\) and \(Lx​(αf)=αLx​(f)\)

Definition (Reproducing kernel Hilbert space RKHS). A Hilbert space \(\mathcal{H}\) of functions is a reproducing kernel Hilbert space (RKHS) if the evaluation functionals \(F_x[f]\) defined as \(F_x[f] = f(x)\) are bounded, i.e., for all \(x ∈ X\) there exists some \(C > 0\) such that

\[\|F_x[f]\|=f(x)\le C\|f\|_{\mathcal{H}}, \qquad \forall f \in \mathcal{H}\]

Interpretation: In many spaces, two functions can be “close” in norm but have very different values at a specific point \(X\). In an RKHS, if functions are close in norm, their values \(f(x)\) are also close.

The Reproducing Property: For every \(x \in X\), there exists a function \(k_x \in \mathcal{H}\) (the representer of evaluation) such that:

\[f(x) = \langle f, k_x \rangle_{\mathcal{H}}\]

If we set \(f = k_y\) (or written as \(k(\cdot, y)\)), we get the kernel itself:

\[k(x, y) = \langle k_x, k_y \rangle_{\mathcal{H}}\]

The RKHS \(\mathcal{H}\) is fully characterized by the reproducing kernel \(k\).

Theorem. For every positive definite function \(k(·, ·)\) on \(X × X\) there exists a unique (up to isometric isomorphism) RKHS with \(k\) as its reproducing kernel. Conversely, the reproducing kernel of an RKHS is unique and positive definite.

The kernel k generates the RKHS, i.e. \(\mathcal{H}=\text{span}\{k(x,\cdot)\mid x\in X\}\)

define \(\phi(x) := k(x,\cdot)=\{x' \to k(x,x')\}\) for all \(X\). \(\mathcal{H}\) is the set of all linear combinations of these functions \(\sum_{i=1}^m \alpha_i \phi(x_i)\) for all \(x_1, \dots, x_m \in \mathcal{X}\), \(\alpha_1, \dots, \alpha_m \in \mathbb{R}\).

Also, we have

\[\left\langle \sum_{i=1}^m \alpha_i \phi(x_i) , \sum_{j=1}^m \beta_j \phi(x'_j)\right\rangle_\mathcal{H} = \sum_{i=1}^m \sum_{j=1}^m \alpha_i \beta_j K(x_i, x_j')\]

By virtue of reproducing property, we know

  • \(k(x, ·)\) is a high-dimensional representer of \(X\)
  • by the reproducing property it also acts as a representer of evaluation of any function in \(\mathcal{H}\) on the data point \(X\)

Definition (Separable Hilbert Space). A Hilbert space \(\mathcal{H}\) is said to be separable if it has a countable basis.

If \(a\in \mathcal{H}\) and \((e_i)_{i\in I}\) is an orthonormal basis for \(\mathcal{H}\). The index sets \(I\) are assumed to be either finite or countably infinite. Then

\[a = \sum_{i∈I} \langle a, e_i\rangle_{\mathcal{H}} e_i .\]

Most we dealt with is separable. example of non-separable \(k(x,y)=\mathbb{(x=y)}\)

2.2.1 Tensor Product (reference)

Definition(Tensor product of vectors).
If x, y are vectors of length M and N, respectively, their tensor product \(x⊗y\) is defined as the \(M ×N\)-matrix defined by \((x ⊗ y)_{ij} = x_iy_j\) . In other words, \(x ⊗ y = xy^T\).

Definition 7.3 (Tensor product of matrices). If \(S : R^M → R^M\) and \(T : R^N → R^N\) are matrices, we define the linear mapping \(S ⊗ T : L_{M,N}(R) → L_{M,N}(R)\) by linear extension of \((S ⊗ T)(e_i ⊗ e_j )=(Se_i) ⊗ (Te_j )\). The linear mapping \(S ⊗ T\) is called the tensor product of the matrices S and T.

Tensor product

  • \(\langle f, k_X \rangle\) \(\langle g, l_Y\rangle = \langle f, k_X⊗l_Y g \rangle\)
  • \((a⊗b) x = \langle x, b\rangle a\), \((ba^T)f=(a^Tf)b\)
  • \(\langle g\otimes f , \phi(Y) \otimes \psi(X) \rangle_{\mathcal{G} \otimes \mathcal{H}}\)=\(\langle g, \psi(Y)\rangle_{\mathcal{G}} \langle \psi(X),f\rangle_{\mathcal{H}}\)
  • \((S\otimes T)(x\otimes y)\) = \((Sx)\otimes(Ty)\)
  • \((S_1 ⊗ T_1)(S_2 ⊗ T_2)\) = \((S1S2) ⊗ (T_1T_2)\)
  • \((S ⊗ T)X = SXT^T\) if S, T are matrices
2.2.2 Hilbert-Schmidt Operator (reference)

Hilbert-Schmidt operator. Let \(\mathcal{F}\) and \(\mathcal{G}\) be separable Hilbert spaces and \((f_i)_{i∈I}\) and \((g_j)_{j∈J}\) are orthonormal basis for \(\mathcal{F}\) and \(\mathcal{G}\), respectively. A Hilbert-Schmidt operator is a bounded operator \(\mathcal{A}: \mathcal{G}\to\mathcal{F}\) whose Hilbert-Schmidt norm

\[\|\mathcal{A}\|_{HS}^2 = \sum_{j\in J} \|\mathcal{A}g_j\|_{\mathcal{F}}^2=\sum_{i\in I}\sum_{j\in J} |\langle \mathcal{A} g_j, f_i\rangle_\mathcal{F}|^2\]

is finite (\(\|\mathcal{A}\|_{HS}^2 < \infty\)).

The Hilbert-Schmidt operators mapping from \(\mathcal{G}\) to \(\mathcal{F}\) form a Hilbert space, written \(HS(\mathcal{G},\mathcal{F})\), with inner product

\[\langle L, M\rangle_{HS} = \sum_{j\in J} \langle L g_j, M g_j\rangle_\mathcal{F}=\sum_{i\in I}\sum_{j \in J} \langle L g_i, f_j\rangle_\mathcal{F}\langle L g_i, M f_j\rangle_\mathcal{F}\]

Given \(b ∈ \mathcal{G}\) and \(a ∈ \mathcal{F}\), we define the tensor product \(a⊗b\) as a rank-one operator from \(\mathcal{G}\) to \(\mathcal{F}\),

\[(a\otimes b)g \mapsto \langle g,b \rangle_\mathcal{G} a\]

The Hilbert-Schmidt norm is

\[\begin{align*} \|a \otimes b\|_{HS}^2 &= \sum_{j\in J} \|(a \otimes b) g_j \|_{\mathcal{F}}^2\\ &= \sum_{j\in J} \|\langle g_j ,b \rangle_\mathcal{G} a \|_{\mathcal{F}}^2\\ &= \sum_{j\in J} |\langle g_j ,b \rangle_\mathcal{G}|^2 \| a \|_\mathcal{F}^2\\ &= \|b\|_\mathcal{G}^2 \|a\|_{\mathcal{F}}^2 \end{align*}\]

Definition (bounded operator).
A linear operator \(A : F → R\) is bounded when

\[Af \le \lambda_A \|f\|_\mathcal{F} \quad \forall f \in \mathcal{F}\]

Property.

  • \(\langle L, a\otimes b \rangle_{HS}=\langle a, Lb \rangle_\mathcal{F}\), given \(a\otimes b, L ∈ HS(\mathcal{G}, \mathcal{F})\),
  • \(\langle u\otimes v, a\otimes b \rangle_{HS}=\langle u, a \rangle_\mathcal{F} \langle b, v \rangle_\mathcal{G}\),
  • \(\|\phi(x) \otimes \psi(y)\|_{HS} = \|\phi(x)\|_{\mathcal{F}} \|\psi(y)\|_{\mathcal{G}}=\sqrt{k(x,x) l(y,y)}\),
  • \(\langle C_{XY}, f\otimes g\rangle_{HS} = E_{xy}\langle \phi(x)\otimes \psi(y), f\otimes g\rangle_{HS} = E_{xy} [\langle f, \phi(x)\rangle_{\mathcal{F}}, \langle g, \psi(y)\rangle_{\mathcal{G}}]=E_{xy}[f(x)g(y)]\),
  • \(\langle C_{XY}, C_{XY}\rangle_{HS} = E_{x,y}E_{x',y'}k(x,x')l(y,y')\),
  • \(\langle \mu_{X} \otimes \mu_{Y}, \mu_{X} \otimes \mu_{Y} \rangle_{HS} = E_{x,x'}E_{y,y'}k(x,x')l(y,y')\),
  • In vector space, \(\langle C_{XY}, A\rangle_{HS} = trace(C_{XY}^\top A)=\sum_j (C_{XY}g_j)^\top (Ag_j)\).
  • \(\|a \otimes b\|_{HS} = \|a\|_{\mathcal{F}}^2 \|b\|_{\mathcal{G}}^2\).

3. Kernel Mean Embeddings

Kernel mean embeddings allow us to represent entire probability distributions as a single point (a function) in an RKHS.

Definition. The mean embedding of a distribution \(\mathbb{P}\) is:

\[\mu_{\mathbb{P}} := \mathbb{E}_{X \sim \mathbb{P}} [k(X, \cdot)] = \int_{\mathcal{X}} \phi(x) d\mathbb{P}(x)\]

Why is this useful?

  • If the kernel is “characteristic” (like Gaussian or Laplace), the mapping \(\mathbb{P} \to \mu_{\mathbb{P}}\) is injective. This means \(\mu_{\mathbb{P}}\) contains all information about the distribution. This means that \(\|{\mu_\mathbb{P}-\mu_\mathbb{Q}}\|\) if and only if \(\mathbb{P}=\mathbb{Q}\).
  • If \(\mathbb{P}\) and \(\mathbb{Q}\) are close in probability distance measures, then \(\mu_\mathbb{P}\) is also close to \(\mu_\mathbb{Q}\) in the \(\|\cdot \|_{\mathcal{H}}\) norm

**Expectations as Inner Products: ** To calculate the expected value of a function \(f\), we simply take an inner product:

\[\mathbb{E}_{\mathbb{P}}[f(x)] = \langle f, \mu_{\mathbb{P}} \rangle_{\mathcal{H}}\]

Lemma (Existence of mean embedding). If \(\mathbb{E}_{X∼\mathbb{P}}[\sqrt{k(X, X)}] < \infty\), then \(\mu_\mathbb{P}\in\mathcal{H}\) and \(\mathbb{E}_{X∼\mathbb{P}}[f(X)]=⟨f,\mu_{\mathbb{P}}⟩_\mathcal{H}\).


4. Operators: Covariance and Conditional Mean

Just as we have covariance matrices for vectors, we have operators for functions.

4.1 Cross-Covariance Operator

Definition. The uncentered cross-covariance operator \(C_{YX}: \mathcal{H} \to \mathcal{G}\) is:

\[C_{YX} := \mathbb{E}_{YX} [\phi(Y) \otimes \phi(X)]\]

It satisfies the property:

\(\langle g, C_{YX} f \rangle_{\mathcal{G}} = \text{Cov}(g(Y), f(X))\).

Centered cross-covariance operator is

\[\tilde C_{YX} :=\mathbb{E}_{YX}[φ(Y)⊗\phi(X)] - \mu_{P_Y} ⊗ \mu_{P_X}=μ_{P_{YX}} - \mu_{P_Y ⊗ P_X},\]

Equivalently, we can define an operator \(C_{YX}\) as a unique bounded operator that satisfies

\[\langle g, C_{YX}f\rangle = Cov[g(Y),f(X)]\]

for all \(f \in \mathcal{H}\) and \(g \in \mathcal{G}\).

Covariance operator. If \(X = Y\) , we call \(C_{XX}\) the covariance operator.

  • \((C_{YX}f)(\cdot)=\int_{X\times Y}l(\cdot,y)f(x) dP_{XY}(x,y)\).
  • \((C_{XX}f)(\cdot)=\int_{X}k(\cdot,x)f(x) dP_{X}(x)\).

Theorem. If \(\mathbb{E}_{YX}[g(Y)\mid X=·]∈ \mathcal{H}\) for \(g∈\mathcal{G}\), then

\[C_{XX}\mathbb{E}_{Y\mid X}[g(Y)\mid X =·]=C_{XY} g\]
Proof for the theorem \[\begin{align*} C_{XX}\mathbb{E}_{Y\mid X}[g(Y)\mid X =·] &= \mathbb{E}_{XX}[\phi(X)⊗\phi(X)] \mathbb{E}_{Y\mid X}[g(Y)\mid X =·]\\ &= \int_{X}\int_Y \phi(x)\otimes\phi(x) \langle g , \varphi(y)\rangle_{\mathcal{G}} dP_X(x)dP_{Y\mid X}(y\mid X) \\ &= \int_X\int_Y \phi(x)\otimes\phi(x) \langle g , \varphi(y)\rangle_{\mathcal{G}} dP_{XY}(x,y) \\ \end{align*}\]

Empirical estimate of the centered \(\tilde C_{YX}\):

\[\begin{align*} \hat C_{YX}&= \frac{1}{n} \sum_{i=1}^{n} l(y_i,\cdot)\otimes k(x_i,\cdot) - \hat \mu_{P_Y} \otimes \hat \mu_{P_X} \\ &=\frac{1}{n} \sum_{i=1}^{n} \{l(y_i,\cdot)-\hat \mu_{P_Y} \} \otimes \{k(x_i,\cdot)-\hat\mu_{P_X} \} \\ &= \frac1n (\Psi - \hat\mu_{P_Y} \mathbf{1}^\top)(\Phi - \hat\mu_{P_Y} \mathbf{1}^\top)^\top \\ &= \frac1n (\Psi (I-\frac1n \mathbf{1} \mathbf{1}^\top))(\Phi (I-\frac1n \mathbf{1} \mathbf{1}^\top))^\top \\ &= \frac1n \Psi (I-\frac1n \mathbf{1} \mathbf{1}^\top) \Phi^\top \\ &=\frac1n \Psi H \Phi^\top \\ \end{align*}\]

where \(H = I_n − \frac1n \mathbb{1}_n\) is the centering matrix with \(\mathbb{1}_n\) an n × n matrix of ones, \(\Psi = (φ(y_1),...,φ(y_n))\) and \(\Phi = (\phi(x_1),...,\phi(x_n))\)

4.2 Conditional Mean Embedding (CME)

The CME represents the conditional distribution \(P(Y\mid X=x)\) as an element in \(\mathcal{G}\).

Suppose \(X\) and \(Y\) be measurable spaces and let \(X\) and \(Y\) be random variables taking values in \(X\) and \(Y\) respectively. Assume \(k : X ×X → R\) and \(l : Y × Y → R\) to be positive definite kernels with corresponding RKHS’s \(H\) and \(G\). Let \(U_{Y\mid X} :H →G\) and \(U_{Y\mid X} ∈G\) be conditional mean embeddings of the conditional distribution \(P(Y \mid X)\) and \(P(Y \mid X =x)\)

\[U_{Y\mid X} = E_{Y\mid X}[\varphi(Y)\mid X=x] = U_{Y\mid X}k(x,\cdot)\] \[E_{Y\mid X}[g(Y)\mid X=x] = \langle g, U_{Y\mid X} \rangle_\mathcal{G} , \forall g \in \mathcal{G}\]

Definition. Let \(C_{XX} : H → H\) and \(C_{YX} : H → G\) be the covariance operator of X and cross- covariance operator between X to Y, respectively. Then, the conditional mean embedding \(U_{Y\mid X} and U_{Y\mid X}\) are defined as

\[U_{Y\mid X}:= C_{YX} C^{-1}_{XX}\] \[U_{Y\mid X}:=C_{YX}C^{-1}_{XX}k(x,\cdot)\]

property

  • \(E_{Y\mid X}[g(Y)\mid X=x]=\langle g, U_{Y\mid X}\rangle=\langle E_{Y\mid X}[g(Y)\mid X], k(x,\cdot)\rangle\).

Empirical estimate:

\[\begin{align*} \hat U_{Y\mid X} &= C_{YX} (C_{XX} + \lambda I)^{-1}\\ &= \Psi (K+\lambda I)^{-1} \Phi^\top\\ \end{align*}\]

where \(\Psi:= (\psi(y_1), \dots, \psi(y_n))\), \(\Phi:= (\phi(x_1), \dots, \phi(x_n))\), \(K=\Phi^\top \Phi\) is the Gram matrix for samples from variable X, and \(λ\) is the regularization parameter to avoid overfitting.

\[\begin{align*} \hat U_{Y\mid X} &= \Psi (K+\lambda I)^{-1} \Phi^\top k(x,\cdot)\\ &= \Psi (K+\lambda I)^{-1} K_{:,x} \\ \end{align*}\]

where \(K_{:,x} = (k(x,X_1), \dots, k(x,X_n))^\top\)

Regression Interpretation: The operator \(U_{Y\mid X} = C_{YX} C_{XX}^{-1}\) is actually the solution to a Vector-Valued Ridge Regression:

\[\hat{U}_{Y\mid X} = \arg\min_{U} \sum_{i=1}^n \|\phi(y_i) - U\phi(x_i)\|^2_{\mathcal{G}} + \lambda \|U\|^2_{HS}\]

References