A note on kernel methods
Kernel methods are one of the most elegant bridges between linear algebra, functional analysis, and modern machine learning. Once you see the pattern—map into a Hilbert space, do linear operations there—everything from SVMs to distribution comparison becomes a special case.
1. Preliminaries: The Geometry of Functions
To understand kernels, we first need to define the spaces in which they live. We transition from finite-dimensional vectors to infinite-dimensional function spaces.
1.1 Function Spaces
A function space \(\mathcal{F}\) is a vector space where the “points” are functions.
Definition. Let \(f, f' \in \mathcal{F}\) correspond to weight vectors \(w, w'\). The operations are defined point-wis
- Addition/Scaling: \((\alpha f + f')(x) = \alpha f(x) + f'(x)\).
- Inner Product: \(\langle f, f' \rangle = w \cdot w'\).
- Norm: \(\|f\|_{\mathcal{F}} = \sqrt{\langle f, f \rangle} = \|w\|\).
1.2 The \(L^2(X, \mu)\) Space
In probability and measure theory, we often care about functions whose “size” is finite under an integral.
Definition. \(L^2(X, \mu)\) consists of all measurable functions \(f\) such that:
\[\|f\|_2 = \left( \int_X |f(x)|^2 d\mu(x) \right)^{1/2} < \infty\]1.3 Hilbert Space
A Hilbert space is essentially a vector space equipped with an inner product that is complete (meaning all Cauchy sequences converge within the space).
Cauchy Sequence: For every \(\epsilon > 0\), there exists \(N\) such that \(\|f_n - f_m\|_{\mathcal{F}} < \epsilon\) for all \(m, n \ge N\).
Completeness: Ensures we can perform calculus and optimization without “falling out” of the space.
2. Kernels and the RKHS
The “Kernel Trick” allows us to compute inner products in high-dimensional feature spaces without ever explicitly computing the coordinates of the data.
2.1 What is a Kernel?
Definition. A function \(k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) is a positive definite kernel if there exists a Hilbert space \(\mathcal{H}\) and a feature map \(\phi(x): \mathcal{X} \to \mathcal{H}\) such that:
\[k(x, x') = \langle \phi(x), \phi(x') \rangle_{\mathcal{H}}\]The Mercer Condition: For \(k\) to be a valid kernel, the kernel matrix \(K\) (where \(K_{ij} = k(x_i, x_j)\)) must be Symmetric Positive Semi-Definite (SPSD).
This means: \(\forall \alpha \in \mathbb{R}^m, \alpha^\top K \alpha \ge 0\).
All eigenvalues of \(K\) are non-negative.
In machine learning, commonly used kernels include the Gaussian and Laplace kernels, i.e.,
\[k(x,x')=\exp\left(-\frac{\| X-x'\|^2_2}{2\sigma^2}\right), \qquad k(x,x')=\exp\left(-\frac{\| X-x'\|_1}{\sigma}\right)\]where \(\sigma > 0\) is a bandwidth parameter. These kernels belong to a class of kernel functions called radial basis functions (RBF).
Universal and characteristic kernel (reference)
Universal and characteristic—have been developing in parallel in machine learning: universal kernels are proposed in the context of achieving the Bayes risk by kernel-based classification/regression algorithms while characteristic kernels are introduced in the context of distinguishing probability measures by embedding them into a reproducing kernel Hilbert space (RKHS).
2.2 Reproducing Kernel Hilbert Space (RKHS)
The RKHS is a special Hilbert space where evaluation is a bounded linear functional.
Evaluation Functional:
- For each \(x\in X\), the evaluation functional \(Lx\) is a map that takes a function \(f∈H\) and returns its value at 𝑥: \(Lx(f)=f(x)\)
- This functional 𝐿𝑥 is linear, meaning: \(Lx(f+g)=Lx(f)+Lx(g)\) and \(Lx(αf)=αLx(f)\)
Definition (Reproducing kernel Hilbert space RKHS). A Hilbert space \(\mathcal{H}\) of functions is a reproducing kernel Hilbert space (RKHS) if the evaluation functionals \(F_x[f]\) defined as \(F_x[f] = f(x)\) are bounded, i.e., for all \(x ∈ X\) there exists some \(C > 0\) such that
\[\|F_x[f]\|=f(x)\le C\|f\|_{\mathcal{H}}, \qquad \forall f \in \mathcal{H}\]Interpretation: In many spaces, two functions can be “close” in norm but have very different values at a specific point \(X\). In an RKHS, if functions are close in norm, their values \(f(x)\) are also close.
The Reproducing Property: For every \(x \in X\), there exists a function \(k_x \in \mathcal{H}\) (the representer of evaluation) such that:
\[f(x) = \langle f, k_x \rangle_{\mathcal{H}}\]If we set \(f = k_y\) (or written as \(k(\cdot, y)\)), we get the kernel itself:
\[k(x, y) = \langle k_x, k_y \rangle_{\mathcal{H}}\]The RKHS \(\mathcal{H}\) is fully characterized by the reproducing kernel \(k\).
Theorem. For every positive definite function \(k(·, ·)\) on \(X × X\) there exists a unique (up to isometric isomorphism) RKHS with \(k\) as its reproducing kernel. Conversely, the reproducing kernel of an RKHS is unique and positive definite.
The kernel k generates the RKHS, i.e. \(\mathcal{H}=\text{span}\{k(x,\cdot)\mid x\in X\}\)
define \(\phi(x) := k(x,\cdot)=\{x' \to k(x,x')\}\) for all \(X\). \(\mathcal{H}\) is the set of all linear combinations of these functions \(\sum_{i=1}^m \alpha_i \phi(x_i)\) for all \(x_1, \dots, x_m \in \mathcal{X}\), \(\alpha_1, \dots, \alpha_m \in \mathbb{R}\).
Also, we have
\[\left\langle \sum_{i=1}^m \alpha_i \phi(x_i) , \sum_{j=1}^m \beta_j \phi(x'_j)\right\rangle_\mathcal{H} = \sum_{i=1}^m \sum_{j=1}^m \alpha_i \beta_j K(x_i, x_j')\]By virtue of reproducing property, we know
- \(k(x, ·)\) is a high-dimensional representer of \(X\)
- by the reproducing property it also acts as a representer of evaluation of any function in \(\mathcal{H}\) on the data point \(X\)
Definition (Separable Hilbert Space). A Hilbert space \(\mathcal{H}\) is said to be separable if it has a countable basis.
If \(a\in \mathcal{H}\) and \((e_i)_{i\in I}\) is an orthonormal basis for \(\mathcal{H}\). The index sets \(I\) are assumed to be either finite or countably infinite. Then
\[a = \sum_{i∈I} \langle a, e_i\rangle_{\mathcal{H}} e_i .\]Most we dealt with is separable. example of non-separable \(k(x,y)=\mathbb{(x=y)}\)
2.2.1 Tensor Product (reference)
Definition(Tensor product of vectors).
If x, y are vectors of length M and N, respectively, their tensor product \(x⊗y\) is defined as the \(M ×N\)-matrix defined by \((x ⊗ y)_{ij} = x_iy_j\) . In other words, \(x ⊗ y = xy^T\).
Definition 7.3 (Tensor product of matrices). If \(S : R^M → R^M\) and \(T : R^N → R^N\) are matrices, we define the linear mapping \(S ⊗ T : L_{M,N}(R) → L_{M,N}(R)\) by linear extension of \((S ⊗ T)(e_i ⊗ e_j )=(Se_i) ⊗ (Te_j )\). The linear mapping \(S ⊗ T\) is called the tensor product of the matrices S and T.
Tensor product
- \(\langle f, k_X \rangle\) \(\langle g, l_Y\rangle = \langle f, k_X⊗l_Y g \rangle\)
- \((a⊗b) x = \langle x, b\rangle a\), \((ba^T)f=(a^Tf)b\)
- \(\langle g\otimes f , \phi(Y) \otimes \psi(X) \rangle_{\mathcal{G} \otimes \mathcal{H}}\)=\(\langle g, \psi(Y)\rangle_{\mathcal{G}} \langle \psi(X),f\rangle_{\mathcal{H}}\)
- \((S\otimes T)(x\otimes y)\) = \((Sx)\otimes(Ty)\)
- \((S_1 ⊗ T_1)(S_2 ⊗ T_2)\) = \((S1S2) ⊗ (T_1T_2)\)
- \((S ⊗ T)X = SXT^T\) if S, T are matrices
2.2.2 Hilbert-Schmidt Operator (reference)
Hilbert-Schmidt operator. Let \(\mathcal{F}\) and \(\mathcal{G}\) be separable Hilbert spaces and \((f_i)_{i∈I}\) and \((g_j)_{j∈J}\) are orthonormal basis for \(\mathcal{F}\) and \(\mathcal{G}\), respectively. A Hilbert-Schmidt operator is a bounded operator \(\mathcal{A}: \mathcal{G}\to\mathcal{F}\) whose Hilbert-Schmidt norm
\[\|\mathcal{A}\|_{HS}^2 = \sum_{j\in J} \|\mathcal{A}g_j\|_{\mathcal{F}}^2=\sum_{i\in I}\sum_{j\in J} |\langle \mathcal{A} g_j, f_i\rangle_\mathcal{F}|^2\]is finite (\(\|\mathcal{A}\|_{HS}^2 < \infty\)).
The Hilbert-Schmidt operators mapping from \(\mathcal{G}\) to \(\mathcal{F}\) form a Hilbert space, written \(HS(\mathcal{G},\mathcal{F})\), with inner product
\[\langle L, M\rangle_{HS} = \sum_{j\in J} \langle L g_j, M g_j\rangle_\mathcal{F}=\sum_{i\in I}\sum_{j \in J} \langle L g_i, f_j\rangle_\mathcal{F}\langle L g_i, M f_j\rangle_\mathcal{F}\]Given \(b ∈ \mathcal{G}\) and \(a ∈ \mathcal{F}\), we define the tensor product \(a⊗b\) as a rank-one operator from \(\mathcal{G}\) to \(\mathcal{F}\),
\[(a\otimes b)g \mapsto \langle g,b \rangle_\mathcal{G} a\]The Hilbert-Schmidt norm is
\[\begin{align*} \|a \otimes b\|_{HS}^2 &= \sum_{j\in J} \|(a \otimes b) g_j \|_{\mathcal{F}}^2\\ &= \sum_{j\in J} \|\langle g_j ,b \rangle_\mathcal{G} a \|_{\mathcal{F}}^2\\ &= \sum_{j\in J} |\langle g_j ,b \rangle_\mathcal{G}|^2 \| a \|_\mathcal{F}^2\\ &= \|b\|_\mathcal{G}^2 \|a\|_{\mathcal{F}}^2 \end{align*}\]Definition (bounded operator).
A linear operator \(A : F → R\) is bounded when
Property.
- \(\langle L, a\otimes b \rangle_{HS}=\langle a, Lb \rangle_\mathcal{F}\), given \(a\otimes b, L ∈ HS(\mathcal{G}, \mathcal{F})\),
- \(\langle u\otimes v, a\otimes b \rangle_{HS}=\langle u, a \rangle_\mathcal{F} \langle b, v \rangle_\mathcal{G}\),
- \(\|\phi(x) \otimes \psi(y)\|_{HS} = \|\phi(x)\|_{\mathcal{F}} \|\psi(y)\|_{\mathcal{G}}=\sqrt{k(x,x) l(y,y)}\),
- \(\langle C_{XY}, f\otimes g\rangle_{HS} = E_{xy}\langle \phi(x)\otimes \psi(y), f\otimes g\rangle_{HS} = E_{xy} [\langle f, \phi(x)\rangle_{\mathcal{F}}, \langle g, \psi(y)\rangle_{\mathcal{G}}]=E_{xy}[f(x)g(y)]\),
- \(\langle C_{XY}, C_{XY}\rangle_{HS} = E_{x,y}E_{x',y'}k(x,x')l(y,y')\),
- \(\langle \mu_{X} \otimes \mu_{Y}, \mu_{X} \otimes \mu_{Y} \rangle_{HS} = E_{x,x'}E_{y,y'}k(x,x')l(y,y')\),
- In vector space, \(\langle C_{XY}, A\rangle_{HS} = trace(C_{XY}^\top A)=\sum_j (C_{XY}g_j)^\top (Ag_j)\).
- \(\|a \otimes b\|_{HS} = \|a\|_{\mathcal{F}}^2 \|b\|_{\mathcal{G}}^2\).
3. Kernel Mean Embeddings
Kernel mean embeddings allow us to represent entire probability distributions as a single point (a function) in an RKHS.
Definition. The mean embedding of a distribution \(\mathbb{P}\) is:
\[\mu_{\mathbb{P}} := \mathbb{E}_{X \sim \mathbb{P}} [k(X, \cdot)] = \int_{\mathcal{X}} \phi(x) d\mathbb{P}(x)\]Why is this useful?
- If the kernel is “characteristic” (like Gaussian or Laplace), the mapping \(\mathbb{P} \to \mu_{\mathbb{P}}\) is injective. This means \(\mu_{\mathbb{P}}\) contains all information about the distribution. This means that \(\|{\mu_\mathbb{P}-\mu_\mathbb{Q}}\|\) if and only if \(\mathbb{P}=\mathbb{Q}\).
- If \(\mathbb{P}\) and \(\mathbb{Q}\) are close in probability distance measures, then \(\mu_\mathbb{P}\) is also close to \(\mu_\mathbb{Q}\) in the \(\|\cdot \|_{\mathcal{H}}\) norm
**Expectations as Inner Products: ** To calculate the expected value of a function \(f\), we simply take an inner product:
\[\mathbb{E}_{\mathbb{P}}[f(x)] = \langle f, \mu_{\mathbb{P}} \rangle_{\mathcal{H}}\]Lemma (Existence of mean embedding). If \(\mathbb{E}_{X∼\mathbb{P}}[\sqrt{k(X, X)}] < \infty\), then \(\mu_\mathbb{P}\in\mathcal{H}\) and \(\mathbb{E}_{X∼\mathbb{P}}[f(X)]=⟨f,\mu_{\mathbb{P}}⟩_\mathcal{H}\).
4. Operators: Covariance and Conditional Mean
Just as we have covariance matrices for vectors, we have operators for functions.
4.1 Cross-Covariance Operator
Definition. The uncentered cross-covariance operator \(C_{YX}: \mathcal{H} \to \mathcal{G}\) is:
\[C_{YX} := \mathbb{E}_{YX} [\phi(Y) \otimes \phi(X)]\]It satisfies the property:
\(\langle g, C_{YX} f \rangle_{\mathcal{G}} = \text{Cov}(g(Y), f(X))\).
Centered cross-covariance operator is
\[\tilde C_{YX} :=\mathbb{E}_{YX}[φ(Y)⊗\phi(X)] - \mu_{P_Y} ⊗ \mu_{P_X}=μ_{P_{YX}} - \mu_{P_Y ⊗ P_X},\]Equivalently, we can define an operator \(C_{YX}\) as a unique bounded operator that satisfies
\[\langle g, C_{YX}f\rangle = Cov[g(Y),f(X)]\]for all \(f \in \mathcal{H}\) and \(g \in \mathcal{G}\).
Covariance operator. If \(X = Y\) , we call \(C_{XX}\) the covariance operator.
- \((C_{YX}f)(\cdot)=\int_{X\times Y}l(\cdot,y)f(x) dP_{XY}(x,y)\).
- \((C_{XX}f)(\cdot)=\int_{X}k(\cdot,x)f(x) dP_{X}(x)\).
Theorem. If \(\mathbb{E}_{YX}[g(Y)\mid X=·]∈ \mathcal{H}\) for \(g∈\mathcal{G}\), then
\[C_{XX}\mathbb{E}_{Y\mid X}[g(Y)\mid X =·]=C_{XY} g\]Proof for the theorem
\[\begin{align*} C_{XX}\mathbb{E}_{Y\mid X}[g(Y)\mid X =·] &= \mathbb{E}_{XX}[\phi(X)⊗\phi(X)] \mathbb{E}_{Y\mid X}[g(Y)\mid X =·]\\ &= \int_{X}\int_Y \phi(x)\otimes\phi(x) \langle g , \varphi(y)\rangle_{\mathcal{G}} dP_X(x)dP_{Y\mid X}(y\mid X) \\ &= \int_X\int_Y \phi(x)\otimes\phi(x) \langle g , \varphi(y)\rangle_{\mathcal{G}} dP_{XY}(x,y) \\ \end{align*}\]Empirical estimate of the centered \(\tilde C_{YX}\):
\[\begin{align*} \hat C_{YX}&= \frac{1}{n} \sum_{i=1}^{n} l(y_i,\cdot)\otimes k(x_i,\cdot) - \hat \mu_{P_Y} \otimes \hat \mu_{P_X} \\ &=\frac{1}{n} \sum_{i=1}^{n} \{l(y_i,\cdot)-\hat \mu_{P_Y} \} \otimes \{k(x_i,\cdot)-\hat\mu_{P_X} \} \\ &= \frac1n (\Psi - \hat\mu_{P_Y} \mathbf{1}^\top)(\Phi - \hat\mu_{P_Y} \mathbf{1}^\top)^\top \\ &= \frac1n (\Psi (I-\frac1n \mathbf{1} \mathbf{1}^\top))(\Phi (I-\frac1n \mathbf{1} \mathbf{1}^\top))^\top \\ &= \frac1n \Psi (I-\frac1n \mathbf{1} \mathbf{1}^\top) \Phi^\top \\ &=\frac1n \Psi H \Phi^\top \\ \end{align*}\]where \(H = I_n − \frac1n \mathbb{1}_n\) is the centering matrix with \(\mathbb{1}_n\) an n × n matrix of ones, \(\Psi = (φ(y_1),...,φ(y_n))\) and \(\Phi = (\phi(x_1),...,\phi(x_n))\)
4.2 Conditional Mean Embedding (CME)
The CME represents the conditional distribution \(P(Y\mid X=x)\) as an element in \(\mathcal{G}\).
Suppose \(X\) and \(Y\) be measurable spaces and let \(X\) and \(Y\) be random variables taking values in \(X\) and \(Y\) respectively. Assume \(k : X ×X → R\) and \(l : Y × Y → R\) to be positive definite kernels with corresponding RKHS’s \(H\) and \(G\). Let \(U_{Y\mid X} :H →G\) and \(U_{Y\mid X} ∈G\) be conditional mean embeddings of the conditional distribution \(P(Y \mid X)\) and \(P(Y \mid X =x)\)
\[U_{Y\mid X} = E_{Y\mid X}[\varphi(Y)\mid X=x] = U_{Y\mid X}k(x,\cdot)\] \[E_{Y\mid X}[g(Y)\mid X=x] = \langle g, U_{Y\mid X} \rangle_\mathcal{G} , \forall g \in \mathcal{G}\]Definition. Let \(C_{XX} : H → H\) and \(C_{YX} : H → G\) be the covariance operator of X and cross- covariance operator between X to Y, respectively. Then, the conditional mean embedding \(U_{Y\mid X} and U_{Y\mid X}\) are defined as
\[U_{Y\mid X}:= C_{YX} C^{-1}_{XX}\] \[U_{Y\mid X}:=C_{YX}C^{-1}_{XX}k(x,\cdot)\]property
- \(E_{Y\mid X}[g(Y)\mid X=x]=\langle g, U_{Y\mid X}\rangle=\langle E_{Y\mid X}[g(Y)\mid X], k(x,\cdot)\rangle\).
Empirical estimate:
\[\begin{align*} \hat U_{Y\mid X} &= C_{YX} (C_{XX} + \lambda I)^{-1}\\ &= \Psi (K+\lambda I)^{-1} \Phi^\top\\ \end{align*}\]where \(\Psi:= (\psi(y_1), \dots, \psi(y_n))\), \(\Phi:= (\phi(x_1), \dots, \phi(x_n))\), \(K=\Phi^\top \Phi\) is the Gram matrix for samples from variable X, and \(λ\) is the regularization parameter to avoid overfitting.
\[\begin{align*} \hat U_{Y\mid X} &= \Psi (K+\lambda I)^{-1} \Phi^\top k(x,\cdot)\\ &= \Psi (K+\lambda I)^{-1} K_{:,x} \\ \end{align*}\]where \(K_{:,x} = (k(x,X_1), \dots, k(x,X_n))^\top\)
Regression Interpretation: The operator \(U_{Y\mid X} = C_{YX} C_{XX}^{-1}\) is actually the solution to a Vector-Valued Ridge Regression:
\[\hat{U}_{Y\mid X} = \arg\min_{U} \sum_{i=1}^n \|\phi(y_i) - U\phi(x_i)\|^2_{\mathcal{G}} + \lambda \|U\|^2_{HS}\]References
- Sriperumbudur et al. (2011), On the Universality and Characteristic Kernels
- Gretton et al., Notes on Mean Embeddings and Covariance Operators
- Muandet et al. (2017), Kernel Mean Embedding of Distributions: A Review and Beyond
- Song et al. (2013), Kernel Embeddings of Conditional Distributions
- Song et al. (2009), Hilbert Space Embeddings of Conditional Distributions with Applications to Dynamical Systems
Enjoy Reading This Article?
Here are some more articles you might like to read next: