A note on kernel methods

Kernel methods are one of the most elegant bridges between linear algebra, functional analysis, and modern machine learning. Once you see the pattern—map into a Hilbert space, do linear operations there—everything from SVMs to distribution comparison becomes a special case.

1. Preliminaries: The Geometry of Functions

To understand kernels, we first need to define the spaces in which they live. We transition from finite-dimensional vectors to infinite-dimensional function spaces.

1.1 Function Spaces

General Definition. Let $X$ and $Y$ be two sets. A function space $\mathcal{F}$ is simply a set of functions that map inputs from $X$ to outputs in $Y$ ($f: X \to Y$).

If the output space $Y$ is a field like real numbers ($\mathbb{R}$) or complex numbers ($\mathbb{C}$), we can turn $\mathcal{F}$ into a Vector Space

Definition (Vector Space of Functions). Let $\mathcal{F}$ be a set of functions mapping $X \to \mathbb{R}$. $\mathcal{F}$ is a vector space if for any $f, g \in \mathcal{F}$ and any scalar $\alpha \in \mathbb{R}$, the following operations are defined point-wise and the results also belong to $\mathcal{F}$:

Vector Addition: $(f + g)(x) = f(x) + g(x)$
Scalar Multiplication: $(\alpha f)(x) = \alpha \cdot f(x)$

To do calculus, machine learning, or analysis, we need to be able to measure how “big” a function is, or the “distance” between two functions (like finding the error between a machine learning model’s prediction and the truth). We do this by defining a Norm.

Definition (Normed Function Space). A vector space of functions $\mathcal{F}$ is a normed space if there is a function $\|\cdot\| : \mathcal{F} \to \mathbb{R}$ (called a norm) that satisfies:

Positivity: $\|f\| \ge 0$, and $\|f\| = 0$ if and only if $f(x) = 0$ everywhere.
Absolute Homogeneity: $$|\alpha f| = \alpha |f|$$
Triangle Inequality: $\|f + g\| \le \|f\| + \|g\|$

If we want to measure “angles” between functions or figure out if two functions are completely uncorrelated (orthogonal), we define an Inner Product. This is the strictest, most structured type of space.

Definition (Inner Product Function Space). A vector space of functions $\mathcal{F}$ is an inner product space if there is an operation $\langle f, g \rangle$ that satisfies:

Symmetry: $\langle f, g \rangle = \langle g, f \rangle$
Linearity: $\langle \alpha f + g, h \rangle = \alpha\langle f, h \rangle + \langle g, h \rangle$
Positive-Definiteness: $\langle f, f \rangle \ge 0$, and $\langle f, f \rangle = 0$ if and only if $f = 0$.

1.2 The $L^2(X, \mu)$ Space

In probability and measure theory, we often care about functions whose “size” is finite under an integral.

Definition. $L^2(X, \mu)$ consists of all measurable functions $f$ such that:

\[\|f\|_2 = \left( \int_X |f(x)|^2 d\mu(x) \right)^{1/2} < \infty\]

If you have two real-valued functions, $f$ and $g$, that both belong to the space $L^2(X, \mu)$, their inner product is defined as the integral of their product:

\[\langle f, g \rangle_{L^2(X, \mu)} = \int_X f(x) g(x) d\mu(x)\]

1.3 Hilbert Space

A Hilbert space is essentially a vector space equipped with an inner product that is complete (meaning all Cauchy sequences converge within the space).

Cauchy Sequence: For every $\epsilon > 0$, there exists $N$ such that $\|f_n - f_m\|_{\mathcal{F}} < \epsilon$ for all $m, n \ge N$.

Completeness: Ensures we can perform calculus and optimization without “falling out” of the space.

2. Kernels and the RKHS

The “Kernel Trick” allows us to compute inner products in high-dimensional feature spaces without ever explicitly computing the coordinates of the data.

2.1 What is a Kernel?

Definition. A function $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ is a positive definite kernel if there exists a Hilbert space $\mathcal{H}$ and a feature map $\phi(x): \mathcal{X} \to \mathcal{H}$ such that:

\[k(x, x') = \langle \phi(x), \phi(x') \rangle_{\mathcal{H}}\]

The Mercer Condition: For $k$ to be a valid kernel, the kernel matrix $K$ (where $K_{ij} = k(x_i, x_j)$) must be Symmetric Positive Semi-Definite (SPSD).

This means: $\forall \alpha \in \mathbb{R}^m, \alpha^\top K \alpha \ge 0$.

All eigenvalues of $K$ are non-negative.

In machine learning, commonly used kernels include the Gaussian and Laplace kernels, i.e.,

\[k(x,x')=\exp\left(-\frac{\| X-x'\|^2_2}{2\sigma^2}\right), \qquad k(x,x')=\exp\left(-\frac{\| X-x'\|_1}{\sigma}\right)\]

where $\sigma > 0$ is a bandwidth parameter. These kernels belong to a class of kernel functions called radial basis functions (RBF).

Universal and characteristic kernel (reference)

Universal and characteristic—have been developing in parallel in machine learning: universal kernels are proposed in the context of achieving the Bayes risk by kernel-based classification/regression algorithms while characteristic kernels are introduced in the context of distinguishing probability measures by embedding them into a reproducing kernel Hilbert space (RKHS).

2.2 Reproducing Kernel Hilbert Space (RKHS)

The RKHS is a special Hilbert space where evaluation is a bounded linear functional.

Evaluation Functional:

For each $x\in X$, the evaluation functional $Lx$ is a map that takes a function $f∈H$ and returns its value at 𝑥: $Lx(f)=f(x)$
This functional 𝐿𝑥 is linear, meaning: $Lx(f+g)=Lx(f)+Lx(g)$ and $Lx(αf)=αLx(f)$

Definition (Reproducing kernel Hilbert space RKHS). A Hilbert space $\mathcal{H}$ of functions is a reproducing kernel Hilbert space (RKHS) if the evaluation functionals $F_x[f]$ defined as $F_x[f] = f(x)$ are bounded, i.e., for all $x ∈ X$ there exists some $C > 0$ such that

\[\|F_x[f]\|=f(x)\le C\|f\|_{\mathcal{H}}, \qquad \forall f \in \mathcal{H}\]

Interpretation: In many spaces, two functions can be “close” in norm but have very different values at a specific point $X$. In an RKHS, if functions are close in norm, their values $f(x)$ are also close.

The Reproducing Property: For every $x \in X$, there exists a function $k_x \in \mathcal{H}$ (the representer of evaluation) such that:

\[f(x) = \langle f, k_x \rangle_{\mathcal{H}}\]

If we set $f = k_y$ (or written as $k(\cdot, y)$), we get the kernel itself:

\[k(x, y) = \langle k_x, k_y \rangle_{\mathcal{H}}\]

The RKHS $\mathcal{H}$ is fully characterized by the reproducing kernel $k$.

Theorem. For every positive definite function $k(·, ·)$ on $X × X$ there exists a unique (up to isometric isomorphism) RKHS with $k$ as its reproducing kernel. Conversely, the reproducing kernel of an RKHS is unique and positive definite.

The kernel k generates the RKHS, i.e. $\mathcal{H}=\text{span}\{k(x,\cdot)\mid x\in X\}$

define $\phi(x) := k(x,\cdot)=\{x' \to k(x,x')\}$ for all $X$. $\mathcal{H}$ is the set of all linear combinations of these functions $\sum_{i=1}^m \alpha_i \phi(x_i)$ for all $x_1, \dots, x_m \in \mathcal{X}$, $\alpha_1, \dots, \alpha_m \in \mathbb{R}$.

Also, we have

\[\left\langle \sum_{i=1}^m \alpha_i \phi(x_i) , \sum_{j=1}^m \beta_j \phi(x'_j)\right\rangle_\mathcal{H} = \sum_{i=1}^m \sum_{j=1}^m \alpha_i \beta_j K(x_i, x_j')\]

By virtue of reproducing property, we know

$k(x, ·)$ is a high-dimensional representer of $X$
by the reproducing property it also acts as a representer of evaluation of any function in $\mathcal{H}$ on the data point $X$

Definition (Separable Hilbert Space). A Hilbert space $\mathcal{H}$ is said to be separable if it has a countable basis.

If $a\in \mathcal{H}$ and $(e_i)_{i\in I}$ is an orthonormal basis for $\mathcal{H}$. The index sets $I$ are assumed to be either finite or countably infinite. Then

\[a = \sum_{i∈I} \langle a, e_i\rangle_{\mathcal{H}} e_i .\]

Most we dealt with is separable. example of non-separable $k(x,y)=\mathbb{(x=y)}$

2.2.1 Tensor Product (reference)

Definition(Tensor product of vectors).
If x, y are vectors of length M and N, respectively, their tensor product $x⊗y$ is defined as the $M ×N$-matrix defined by $(x ⊗ y)_{ij} = x_iy_j$ . In other words, $x ⊗ y = xy^T$.

Definition 7.3 (Tensor product of matrices). If $S : R^M → R^M$ and $T : R^N → R^N$ are matrices, we define the linear mapping $S ⊗ T : L_{M,N}(R) → L_{M,N}(R)$ by linear extension of $(S ⊗ T)(e_i ⊗ e_j )=(Se_i) ⊗ (Te_j )$. The linear mapping $S ⊗ T$ is called the tensor product of the matrices S and T.

Tensor product

$\langle f, k_X \rangle$ $\langle g, l_Y\rangle = \langle f, k_X⊗l_Y g \rangle$
$(a⊗b) x = \langle x, b\rangle a$, $(ba^T)f=(a^Tf)b$
$\langle g\otimes f , \phi(Y) \otimes \psi(X) \rangle_{\mathcal{G} \otimes \mathcal{H}}$=$\langle g, \psi(Y)\rangle_{\mathcal{G}} \langle \psi(X),f\rangle_{\mathcal{H}}$
$(S\otimes T)(x\otimes y)$ = $(Sx)\otimes(Ty)$
$(S_1 ⊗ T_1)(S_2 ⊗ T_2)$ = $(S1S2) ⊗ (T_1T_2)$
$(S ⊗ T)X = SXT^T$ if S, T are matrices

2.2.2 Hilbert-Schmidt Operator (reference)

Hilbert-Schmidt operator. Let $\mathcal{F}$ and $\mathcal{G}$ be separable Hilbert spaces and $(f_i)_{i∈I}$ and $(g_j)_{j∈J}$ are orthonormal basis for $\mathcal{F}$ and $\mathcal{G}$, respectively. A Hilbert-Schmidt operator is a bounded operator $\mathcal{A}: \mathcal{G}\to\mathcal{F}$ whose Hilbert-Schmidt norm

\[\|\mathcal{A}\|_{HS}^2 = \sum_{j\in J} \|\mathcal{A}g_j\|_{\mathcal{F}}^2=\sum_{i\in I}\sum_{j\in J} |\langle \mathcal{A} g_j, f_i\rangle_\mathcal{F}|^2\]

is finite ($\|\mathcal{A}\|_{HS}^2 < \infty$).

The Hilbert-Schmidt operators mapping from $\mathcal{G}$ to $\mathcal{F}$ form a Hilbert space, written $HS(\mathcal{G},\mathcal{F})$, with inner product

\[\langle L, M\rangle_{HS} = \sum_{j\in J} \langle L g_j, M g_j\rangle_\mathcal{F}=\sum_{i\in I}\sum_{j \in J} \langle L g_i, f_j\rangle_\mathcal{F}\langle L g_i, M f_j\rangle_\mathcal{F}\]

Given $b ∈ \mathcal{G}$ and $a ∈ \mathcal{F}$, we define the tensor product $a⊗b$ as a rank-one operator from $\mathcal{G}$ to $\mathcal{F}$,

\[(a\otimes b)g \mapsto \langle g,b \rangle_\mathcal{G} a\]

The Hilbert-Schmidt norm is

\[\begin{align*} \|a \otimes b\|_{HS}^2 &= \sum_{j\in J} \|(a \otimes b) g_j \|_{\mathcal{F}}^2\\ &= \sum_{j\in J} \|\langle g_j ,b \rangle_\mathcal{G} a \|_{\mathcal{F}}^2\\ &= \sum_{j\in J} |\langle g_j ,b \rangle_\mathcal{G}|^2 \| a \|_\mathcal{F}^2\\ &= \|b\|_\mathcal{G}^2 \|a\|_{\mathcal{F}}^2 \end{align*}\]

Definition (bounded operator).
A linear operator $A : F → R$ is bounded when

\[Af \le \lambda_A \|f\|_\mathcal{F} \quad \forall f \in \mathcal{F}\]

Property.

$\langle L, a\otimes b \rangle_{HS}=\langle a, Lb \rangle_\mathcal{F}$, given $a\otimes b, L ∈ HS(\mathcal{G}, \mathcal{F})$,
$\langle u\otimes v, a\otimes b \rangle_{HS}=\langle u, a \rangle_\mathcal{F} \langle b, v \rangle_\mathcal{G}$,
$\|\phi(x) \otimes \psi(y)\|_{HS} = \|\phi(x)\|_{\mathcal{F}} \|\psi(y)\|_{\mathcal{G}}=\sqrt{k(x,x) l(y,y)}$,
$\langle C_{XY}, f\otimes g\rangle_{HS} = E_{xy}\langle \phi(x)\otimes \psi(y), f\otimes g\rangle_{HS} = E_{xy} [\langle f, \phi(x)\rangle_{\mathcal{F}}, \langle g, \psi(y)\rangle_{\mathcal{G}}]=E_{xy}[f(x)g(y)]$,
$\langle C_{XY}, C_{XY}\rangle_{HS} = E_{x,y}E_{x',y'}k(x,x')l(y,y')$,
$\langle \mu_{X} \otimes \mu_{Y}, \mu_{X} \otimes \mu_{Y} \rangle_{HS} = E_{x,x'}E_{y,y'}k(x,x')l(y,y')$,
In vector space, $\langle C_{XY}, A\rangle_{HS} = trace(C_{XY}^\top A)=\sum_j (C_{XY}g_j)^\top (Ag_j)$.
$\|a \otimes b\|_{HS} = \|a\|_{\mathcal{F}}^2 \|b\|_{\mathcal{G}}^2$.

3. Kernel Mean Embeddings

Kernel mean embeddings allow us to represent entire probability distributions as a single point (a function) in an RKHS.

Definition. The mean embedding of a distribution $\mathbb{P}$ is:

\[\mu_{\mathbb{P}} := \mathbb{E}_{X \sim \mathbb{P}} [k(X, \cdot)] = \int_{\mathcal{X}} \phi(x) d\mathbb{P}(x)\]

Why is this useful?

If the kernel is “characteristic” (like Gaussian or Laplace), the mapping $\mathbb{P} \to \mu_{\mathbb{P}}$ is injective. This means $\mu_{\mathbb{P}}$ contains all information about the distribution. This means that $\|{\mu_\mathbb{P}-\mu_\mathbb{Q}}\|$ if and only if $\mathbb{P}=\mathbb{Q}$.
If $\mathbb{P}$ and $\mathbb{Q}$ are close in probability distance measures, then $\mu_\mathbb{P}$ is also close to $\mu_\mathbb{Q}$ in the $\|\cdot \|_{\mathcal{H}}$ norm

**Expectations as Inner Products: ** To calculate the expected value of a function $f$, we simply take an inner product:

\[\mathbb{E}_{\mathbb{P}}[f(x)] = \langle f, \mu_{\mathbb{P}} \rangle_{\mathcal{H}}\]

Lemma (Existence of mean embedding). If $\mathbb{E}_{X∼\mathbb{P}}[\sqrt{k(X, X)}] < \infty$, then $\mu_\mathbb{P}\in\mathcal{H}$ and $\mathbb{E}_{X∼\mathbb{P}}[f(X)]=⟨f,\mu_{\mathbb{P}}⟩_\mathcal{H}$.

4. Operators: Covariance and Conditional Mean

Just as we have covariance matrices for vectors, we have operators for functions.

4.1 Cross-Covariance Operator

Definition. The uncentered cross-covariance operator $C_{YX}: \mathcal{H} \to \mathcal{G}$ is:

\[C_{YX} := \mathbb{E}_{YX} [\phi(Y) \otimes \phi(X)]\]

It satisfies the property:

$\langle g, C_{YX} f \rangle_{\mathcal{G}} = \text{Cov}(g(Y), f(X))$.

Centered cross-covariance operator is

\[\tilde C_{YX} :=\mathbb{E}_{YX}[φ(Y)⊗\phi(X)] - \mu_{P_Y} ⊗ \mu_{P_X}=μ_{P_{YX}} - \mu_{P_Y ⊗ P_X},\]

Equivalently, we can define an operator $C_{YX}$ as a unique bounded operator that satisfies

\[\langle g, C_{YX}f\rangle = Cov[g(Y),f(X)]\]

for all $f \in \mathcal{H}$ and $g \in \mathcal{G}$.

Covariance operator. If $X = Y$ , we call $C_{XX}$ the covariance operator.

$(C_{YX}f)(\cdot)=\int_{X\times Y}l(\cdot,y)f(x) dP_{XY}(x,y)$.
$(C_{XX}f)(\cdot)=\int_{X}k(\cdot,x)f(x) dP_{X}(x)$.

Theorem. If $\mathbb{E}_{YX}[g(Y)\mid X=·]∈ \mathcal{H}$ for $g∈\mathcal{G}$, then

\[C_{XX}\mathbb{E}_{Y\mid X}[g(Y)\mid X =·]=C_{XY} g\]

Proof for the theorem

\[\begin{align*} C_{XX}\mathbb{E}_{Y\mid X}[g(Y)\mid X =·] &= \mathbb{E}_{XX}[\phi(X)⊗\phi(X)] \mathbb{E}_{Y\mid X}[g(Y)\mid X =·]\\ &= \int_{X}\int_Y \phi(x)\otimes\phi(x) \langle g , \varphi(y)\rangle_{\mathcal{G}} dP_X(x)dP_{Y\mid X}(y\mid X) \\ &= \int_X\int_Y \phi(x)\otimes\phi(x) \langle g , \varphi(y)\rangle_{\mathcal{G}} dP_{XY}(x,y) \\ \end{align*}\]

Empirical estimate of the centered $\tilde C_{YX}$:

\[\begin{align*} \hat C_{YX}&= \frac{1}{n} \sum_{i=1}^{n} l(y_i,\cdot)\otimes k(x_i,\cdot) - \hat \mu_{P_Y} \otimes \hat \mu_{P_X} \\ &=\frac{1}{n} \sum_{i=1}^{n} \{l(y_i,\cdot)-\hat \mu_{P_Y} \} \otimes \{k(x_i,\cdot)-\hat\mu_{P_X} \} \\ &= \frac1n (\Psi - \hat\mu_{P_Y} \mathbf{1}^\top)(\Phi - \hat\mu_{P_Y} \mathbf{1}^\top)^\top \\ &= \frac1n (\Psi (I-\frac1n \mathbf{1} \mathbf{1}^\top))(\Phi (I-\frac1n \mathbf{1} \mathbf{1}^\top))^\top \\ &= \frac1n \Psi (I-\frac1n \mathbf{1} \mathbf{1}^\top) \Phi^\top \\ &=\frac1n \Psi H \Phi^\top \\ \end{align*}\]

where $H = I_n − \frac1n \mathbb{1}_n$ is the centering matrix with $\mathbb{1}_n$ an n × n matrix of ones, $\Psi = (φ(y_1),...,φ(y_n))$ and $\Phi = (\phi(x_1),...,\phi(x_n))$

4.2 Conditional Mean Embedding (CME)

The CME represents the conditional distribution $P(Y\mid X=x)$ as an element in $\mathcal{G}$.

Suppose $X$ and $Y$ be measurable spaces and let $X$ and $Y$ be random variables taking values in $X$ and $Y$ respectively. Assume $k : X ×X → R$ and $l : Y × Y → R$ to be positive definite kernels with corresponding RKHS’s $H$ and $G$. Let $U_{Y\mid X} :H →G$ and $U_{Y\mid X} ∈G$ be conditional mean embeddings of the conditional distribution $P(Y \mid X)$ and $P(Y \mid X =x)$

\[U_{Y\mid X} = E_{Y\mid X}[\varphi(Y)\mid X=x] = U_{Y\mid X}k(x,\cdot)\] \[E_{Y\mid X}[g(Y)\mid X=x] = \langle g, U_{Y\mid X} \rangle_\mathcal{G} , \forall g \in \mathcal{G}\]

Definition. Let $C_{XX} : H → H$ and $C_{YX} : H → G$ be the covariance operator of X and cross- covariance operator between X to Y, respectively. Then, the conditional mean embedding $U_{Y\mid X} and U_{Y\mid X}$ are defined as

\[U_{Y\mid X}:= C_{YX} C^{-1}_{XX}\] \[U_{Y\mid X}:=C_{YX}C^{-1}_{XX}k(x,\cdot)\]

property

$E_{Y\mid X}[g(Y)\mid X=x]=\langle g, U_{Y\mid X}\rangle=\langle E_{Y\mid X}[g(Y)\mid X], k(x,\cdot)\rangle$.

Empirical estimate:

\[\begin{align*} \hat U_{Y\mid X} &= C_{YX} (C_{XX} + \lambda I)^{-1}\\ &= \Psi (K+\lambda I)^{-1} \Phi^\top\\ \end{align*}\]

where $\Psi:= (\psi(y_1), \dots, \psi(y_n))$, $\Phi:= (\phi(x_1), \dots, \phi(x_n))$, $K=\Phi^\top \Phi$ is the Gram matrix for samples from variable X, and $λ$ is the regularization parameter to avoid overfitting.

\[\begin{align*} \hat U_{Y\mid X} &= \Psi (K+\lambda I)^{-1} \Phi^\top k(x,\cdot)\\ &= \Psi (K+\lambda I)^{-1} K_{:,x} \\ \end{align*}\]

where $K_{:,x} = (k(x,X_1), \dots, k(x,X_n))^\top$

Regression Interpretation: The operator $U_{Y\mid X} = C_{YX} C_{XX}^{-1}$ is actually the solution to a Vector-Valued Ridge Regression:

\[\hat{U}_{Y\mid X} = \arg\min_{U} \sum_{i=1}^n \|\phi(y_i) - U\phi(x_i)\|^2_{\mathcal{G}} + \lambda \|U\|^2_{HS}\]

References

Sriperumbudur et al. (2011), On the Universality and Characteristic Kernels
Gretton et al., Notes on Mean Embeddings and Covariance Operators
Muandet et al. (2017), Kernel Mean Embedding of Distributions: A Review and Beyond
Song et al. (2013), Kernel Embeddings of Conditional Distributions
Song et al. (2009), Hilbert Space Embeddings of Conditional Distributions with Applications to Dynamical Systems