Natural Gradient the Differential Geometric Way
Natural gradient adaptation (Amari, 1998) is a method for performing gradient descent on a Riemannian manifold. While much of what has been written about natural gradients comes from the statistics and machine learning literature, we can understand the fundamentals of the method independently of these specific application domains.
This post is basically a severely abbreviated (yet notably un-short) introduction to differential geometry, with the objective of defining natural gradient adaptation in the language of that framework. I will mostly give definitions, only providing proofs when they are particularly helpful in understanding the definitions. If the topic interests you, I strongly recommend the series of lectures by Dr. Frederic Schuller, which go into way more depth on the topic of differential geometry than I do here.
Motivation
Many discussions of gradient descent for optimization are predicated on several tacit assumptions that are easily overlooked when we only care about optimization problems on \(\R^d\). When we move to more general spaces -- or even just non-linear parameterizations of \(\R^d\) -- we run into problems if we don't closely examine these assumptions.
The first assumption is that points in a space are themselves vectors, which we implicitly assume when we coordinatize the space with tuples of real numbers and then add vectors to points component-wise. In \(\R^d\) this is fine -- it is a vector space and so its points are vectors. In general, though, we must separate the notions of points in a manifold, local coordinates around a point, and vectors tangent to the manifold at a point.
The second assumption is that we have only one vector space in play (indeed, in the case of \(\R^d\) we may not even explicitly distinguish it from the space of points we're optimizing within). Put another way, we are usually imprecise about where exactly vectors live. Do they live in the same space as the points where we evaluate our optimzation function? Are we to picture them all as rooted at the origin of the space? In fact, we'll find we need to define distinct vector spaces attached to the manifold at every point (so-called tangent spaces). We'll only be able to add vectors belonging to the same vector space (i.e., attached to the same point), and we'll only be able to displace ourselves from a point using vectors drawn from the tangent space at that point.
The third assumption has two parts, pertaining to the tuple of partial derivatives of a function \(f\) at a point: (1) that the tuple constitutes a vector, which we can add component-wise to the coordinates of a point to get a new point, and (2) that the tuple corresponds to the direction in which \(f\) changes most rapidly. In general we need to do some more work to get from a differentiable function \(f\), to a vector whose direction corresponds to the most rapid change of \(f\), and which we can use to displace ourselves from the point.
To address these subtleties, we will start from a very basic idea -- that of a topological manifold -- and build our way up to
- Develop the definition of a smooth manifold
- Define the gradient of a function on a smooth manifold
- Introduce the concept of metric tensor, making our smooth manifold "Riemannian"
- Define formally the natural gradient of a function on a manifold
- Show that the natural gradient corresponds to the direction of steepest ascent.
So, we start at the start.
Topological manifolds
A topological manifold is, in essence, a space that looks flat ("Euclidean") from up close, but may have a more complicated geometry when you "zoom out". For example, Earth appears flat from the surface, but viewed from space its spherical geometry becomes apparent. To make this notion mathematically precise, we start with a topological space, which is a set \(X\) with a collection \(\tau\) of subsets of \(X\) satisfying three axioms:
- \(\varnothing, X \in \tau\)
- \(\mcU \subset \tau \implies \left(\bigcup\limits_{U \in \, \mcU} U\right) \in \tau\)
- \(\{U_{i = 1..n}\} \subset \tau \implies \bigcap\{U_i\} \in \tau\)
Axiom 2 applies to any collection of sets \(\mcU \subset \tau\), while axiom 3 applies only to finite subsets of \(\tau.\) The elements of \(\tau\) are called open sets and can be thought of as representing a squishy notion of closeness in \(X\): very roughly speaking, the "more" open sets that contain a pair of points (elements of \(X\)), the closer the points are to each other. This generalizes the notion of open \(\epsilon\)-balls in a metric space, and any metric space has a canonical metric topology generated by its open \(\epsilon\)-balls.
Given two topological spaces \((X, \tau_{\scriptsize X})\) and \((Y, \tau_{\scriptsize Y})\) and a function \(f: X \to Y\), we say \(f\) is continuous if for any \(U\) open in \(Y\), the pre-image under \(f\) is open in \(X\): \[ f^{-1}(U) = \{x \in X | f(x) \in U\} \in \tau_{\scriptsize X} \] This is a generalization of the usual \(\epsilon, \delta\) definition of continuity. It essentially says that \(f\) should take nearby stuff in \(X\) to nearby stuff in \(Y.\) Two topological spaces are said to be homeomorphic when there exists a continuous bijection \(f: X \to Y\) whose inverse is also continuous.
A topological manifold \((\M, \tau)\) of dimension \(d\) is a topological space in which each point \(p \in \M\) has an open neighborhood \(U\) which is homeomorphic to an open subset of \(\R^d.\) This is how we state formally that the space "looks flat up close". Note that \(\R^d\) is assumed to have the metric topology induced by the Euclidean metric.
Charts and Atlases
Once we know that every point has a neighborhood that looks Euclidean, we can use this knowledge to inherit a notion of differentiability from \(\R^d.\) Fix any point \(p \in \M.\) The definition of topological manifold guarantees us an open set \(U \subset \M\) containing \(p\) and a homeomorphism \(\varphi : U \to V \subset \R^d.\) Such a pair \((U, \varphi)\) of open set and mapping is called a chart. An atlas is a collection \(\mathcal{A}\) of charts such that the constituent open sets cover all of \(\M\), i.e., \(\bigcup\limits_{\alpha \in A} U_\alpha = \M,\) where \(A\) is some index set over the elements of \(\mathcal{A}\)
Consider two charts, \((U_\alpha, \varphi_\alpha)\) and \((U_\beta, \varphi_\beta),\) whose domains overlap, i.e. \(U = U_\alpha \cup U_\beta \neq \varnothing\). Let \(V_\alpha\) and \(V_\beta\) be the images of \(U\) under \(\varphi_\alpha\) and \(\varphi_\beta\), respectively: \(V_\alpha = \varphi_\alpha(U) \subset \R^d\) and \(V_\beta = \varphi_\beta(U) \subset \R^d.\) We call \(\varphi_\beta \circ \varphi_\alpha^{-1} : V_\alpha \to V_\beta\) a transition map (note we should actually talk about restrictions of \(\varphi_\alpha\) and \(\varphi_\beta\) to the intersection \(U\); we're being a little loose in the notation for brevity). As a map between subsets of Euclidean space \(\R^d\), this is the first object we've discussed so far for which we possess a notion of differentiability a priori! Note that
It's useful to define a notation for so-called differentiability classes of functions. A continuous map is said to be of class \(C^0\). A map which is differentiable \(k\) times is said to be of class \(C^k.\) A map which is class \(C^k\) for all \(k \in \mathbb{N}\) is said to be of class \(C^\infty.\) We usually omit "of class" and just say a function is \(C^k.\) Sometimes it's useful to be explicit about the domain and codomain of a differentiability class. We denote these parenthetically, e.g., \(C^k(X, \R^d)\) is the set of \(k\)-times differentiable maps from \(X\) to \(\R^d.\) When the codomain is clear from context, we may omit it and write \(C^k(X)\)
A smooth manifold is then defined as a topological manifold \((\M, \tau)\) with an atlas for which all the transition maps are \(C^\infty(\R^d, \R^d).\) This definition may seem a bit odd. The motivation is that we'll rely heavily on charts in order to induce a notion of differentiability on the manifold. If we had transition maps that were not smooth, then our induced differentiability structure would be inconsistent where the charts overlap. Note that we don't strictly need \(C^\infty\) transitions in general. Usually requiring \(C^\infty\) maps is just a lazy way of saying we require everything to be differentiable as often as any particular application requires.
Charts allow us (locally) to construct a system of coordinates on a manifold. When we are working with a chart \((U, \varphi)\) on a manifold \(\M\), we will often write \(\varphi^i\) to denote the \(i\)th coordinate of a point in the image of \(\varphi.\) In this sense, each component of the chart is a map \(\varphi^i: U \to \R.\)
Functions on a Smooth Manifold
In the same way that we inherit a notion of differentiability of transition maps from the familiar notion of differentiability in Euclidean space, we can combine charts with functions on a manifold to get a notion of differentiability "through" the manifold. Let's make this precise by starting with the simplest case of real-valued functions on a \(d\)-dimensional manifold \(\M.\) Let \(f: \M \to \R\) and let \((U, \varphi)\) be a chart on \(\M.\) We can form the function \[ f \circ \varphi^{-1} : \varphi(U) \to \R \] and differentiate this as usual as a multivariate real-valued function on \(\varphi(U) \subset \R^d.\) We use the shorthand \(\pd{f}{\varphi^i}\) to denote the partial derivative of \(f \circ \varphi^{-1}\) with respect to the \(i\)th coordinate of \(\R^d.\) Now, this expression looks a lot like what we'd usually call a gradient, but don't be fooled! This is nothing but a tuple of partial derivatives of a function with respect to its positional parameters. It's true that in Euclidean space with Cartesian coordinates, the gradient of a function has components equal to the partial derivatives of that function with respect to its positional parameters, but this does not hold generally. We will define the gradient shortly, but first we need the notions of tangent spaces, vectors and covectors.
Curves on a Smooth Manifold
In pursuit of the notion of vector fields on a manifold, we'll first construct smooth curves, and from them a notion of tangent vectors at a point. We'll generally be interested in curves just in the vicinity of a point \(p \in \M.\) Thus we'll generally assume our curves are defined on a small interval of \(\R\) centered around \(0\) and taking \(0\) to \(p\): \[ \gamma : \Ieps \to \M, \quad \gamma(0) = p \] We require that \(\gamma\) be continuous and furthermore that the composition \((\varphi \circ \gamma): \Ieps \to \R^d\) be \(C^\infty.\) Such a map \(\gamma\) is called a smooth curve on the manifold \(\M.\) We can differentiate the composition \(\varphi \circ \gamma\) to get a set of real values \[ \td{}{t}\left(\varphi^i \circ \gamma\right)(t) \in \R. \] We use the shorthand \(\td{}{t} \gamma(t)\), or \(\dot{\gamma}(t)\) to denote these tangents to the curve. When we need to refer to individual components we write \(\dot{\gamma}^i(t).\) To be precise, if \(\{\e^i\}_{i=1..d}\) are a basis on \(\R^d\), we have \[ \dot{\gamma}(t) = \sum_i \dot{\gamma}^i(t) \e^i. \label{gamma_dot}\tag{1} \]
These quantities are something like a tangent vector to the curve. But a vector needs to live in a vector space so to call this structure a vector we'd better define the space it lives in. We should also note that the above representation depends on the choice of chart.
Tangent Spaces on a Smooth Manifold
We'd like to think of these \(\dot{\gamma}\,\)'s as tangent vectors, but we need to deal with two issues:
- non-uniqueness: lots of curves will have the same derivatives at a point
- dependence on choice of chart.
As with many non-uniqueness problems in mathematics, the trick will be to work in terms of equivalence classes. Recall that equivalence classes are sets of objects that are related to each other by an equivalence relation. We define the equivalence relation \(\sim\) on the set of smooth curves by \(\gamma_1 \sim \gamma_2\) iff for any chart \((U, \varphi)\) with \(p \in U\), \[ \td{}{t}\left(\varphi \circ \gamma_1\right)\bigg|_{t = 0} = \td{}{t}\left(\varphi \circ \gamma_2\right)\bigg|_{t = 0}. \]
It's straightforward to show that this relation is reflexive, symmetric, and transitive, and hence an equivalence relation. More subtle is showing that the equivalence relation is independent of the choice of chart. Suppose we have two charts \((U, \varphi_1)\) and \((U, \varphi_2)\) covering \(p\) (we assume they cover the same neighborhood for simplicity; otherwise we would need to take intersections and make sure we consider curves restricted to that intersection. This could be done and everything would be okay). To see that the two charts give rise to the same equivalence classes, we suppose that \[ \td{}{t}\left(\varphi_1 \circ \gamma_1\right)\bigg|_{t=0} = \td{}{t}\left(\varphi_1 \circ \gamma_2\right)\bigg|_{t=0}. \] and seek to show that the same holds for \(\varphi_2.\) To see this, we can use the fact that \(\varphi^{-1}_1 \circ \varphi_1 = \mathrm{id}_\M\), along with the ordinary chain rule of (partial) differentiation as follows: \[ \begin{align} \td{}{t}\left(\varphi_2 \circ \gamma_1\right)\bigg|_{t=0} & = \td{}{t}\left(\varphi_2 \circ \varphi_1^{-1} \circ \varphi_1 \circ \gamma_1\right)\bigg|_{t=0} \\ & = \sum_i \pd{}{\varphi_1^i}\left(\varphi_2 \circ \varphi_1^{-1}\right) \bigg|_{\varphi_1(\gamma_1(0))} \cdot \td{}{t} \left( \varphi_1 \circ \gamma_1\right) \bigg|_{t = 0} \\ & = \sum_i \pd{}{\varphi_1^i}\left(\varphi_2 \circ \varphi_1^{-1}\right) \bigg|_{\varphi_1(\gamma_2(0))} \cdot \td{}{t} \left( \varphi_1 \circ \gamma_2\right) \bigg|_{t = 0} \\ & = \td{}{t}\left(\varphi_2 \circ \varphi_1^{-1} \circ \varphi_1 \circ \gamma_2\right)\bigg|_{t=0} \\ & = \td{}{t}\left(\varphi_2 \circ \gamma_2\right)\bigg|_{t=0} \end{align} \] where we write \(\pd{}{\varphi_1^i}\) for the partial derivative with respect to the \(i\)th coordinate value in the codomain of \(\varphi_1\). The substitutions of \(\gamma_2\) for \(\gamma_1\) between lines two and three are justified because
- \(\gamma_1(0) = \gamma_2(0) = p\), and
- by assumption, \(\gamma_1 \sim \gamma_2\) under the chart \(\varphi_1.\)
We call the set of such equivalence classes of curves \(\tpm\), the tangent space to \(\M\) at \(p.\) Showing that \(\tpm\) is an \(\R\)-vector space is yet more work! We need to define addition and scalar mutliplication of equivalence classes of curves. This is done, again, by way of a chart. Given two equivalence classes \(X, Y \in \tpm\), we choose a representative member (or "germ"), say, \(\gx, \gy\), and compose these with a chart map at \(p.\) This gives us a pair of curves in \(\R^d.\) In \(\R^d\) we can easily combine points according to the usual vector addition. We then map back by the inverse chart. Formally: \[ +_{\tpm}: \tpm \times \tpm \to \tpm \\ X +_{\tpm} Y = \left[\varphi^{-1} \circ \big( (\varphi \circ \gx) +_{\R^d} (\varphi \circ \gy)\big) \right]_\sim \] where we emphasize that the addition on the right side is in the vector space \(\R^d.\) Note that we must again prove that this definition is independent of the choice of chart (it is). We must also show that the definition is independent of the choice of representatives from the equivalence classes. We will not go through all of that here; doing so is great practice in applying the definitions!
Tangent Vectors as Derivations on Functions
We can use our newly minted tangent vectors at a point \(p\) to define directional derivative operators at \(p.\) Let \(f: \M \to \R\) be a function, \(X \in \tpm\) a tangent vector, and \(\gamma \in X\) a representative of the equivalence class that \(X\) represents. Then \(f \circ \gamma: \R \to \R\) and we can think of \(X\) as acting on \(f\) in the following sense, \[ X(f) = \td{}{t} f\big(\gamma(t)\big)\bigg|_{t = 0}. \] \(X\) can be seen as a real-valued map \(X: C^\infty(\M, \R) \to \R\) on the space of smooth real-valued functions on \(\M.\) Given two functions \(f\) and \(g\) and real numbers \(a, b\) we have the following, more-or-less immediately: \[ X(af + bg) = aX(f) + bX(g) \] and \[ \begin{align} X(fg) & = \td{}{t} \bigg(fg\big(\gamma(t)\big) \bigg)\bigg|_{t = 0} \\ & = \left( \td{}{t} \bigg(f\big(\gamma(t)\big)\bigg) g\big(\gamma(t)\big) + f\big(\gamma(t)\big)\td{}{t} \bigg( g\big(\gamma(t)\big) \bigg) \right)\bigg|_{t=0} \\ & = X(f) g + f X(g) \end{align} \] The first relation states that the map \(X\) is \(\R\)-linear. The second is called the Leibniz rule (or product rule). An object satisfying both is called a derivation; some texts define tangent vectors axiomatically as derivations on the smooth functions at a point.
We can show that derivations combine linearly: \((aX + bY)(f) = aX(f) + bY(f).\) To do so, we need to recall our rule for adding tangent vectors, which are actually equivalence classes of curves. To simplify things a bit, let's consider addition of vectors first, and defer scalar multiplications, which will then be straightforward. For vectors \(X\) and \(Y\), let \(\gx\) and \(\gy\) be germs. Then \[ \begin{align} X + Y & = [\gx]_\sim +_{\tpm} [\gy]_\sim \\ & = \left[ \varphi^{-1} \left( (\varphi \circ \gx) +_{\R^d} (\varphi \circ \gy) \right) \right]_\sim \\ & = \left[ \gxpy \right]_\sim \end{align} \] To show how this acts on a function \(f\), we again use the \(\varphi^{-1} \circ \varphi = \mathrm{id}_\M\) trick to write \[ \begin{align} (X + Y)(f) & = \td{}{t} \bigg( f \circ \gxpy \bigg)\bigg|_{t = 0} \\ & \: = \td{}{t} \bigg( f \circ \varphi^{-1} \circ \varphi \circ \gxpy \bigg)\bigg|_{t = 0} \end{align} \] Now since \(f \circ \varphi^{-1} : \R^d \to \R\) and \(\varphi \circ \gxpy : \R \to \R^d\) we can use the chain rule to write \[ \begin{align} (X + Y)(f) & = \: \sum_i \pd{}{\varphi^i} \bigg(f \circ \varphi^{-1} \bigg) \bigg|_{\varphi(\gxpy(0))} \cdot \td{}{t} \bigg( \varphi^i \circ \gxpy \bigg)\bigg|_{t = 0} \\ & = \sum_i \pd{f}{\varphi^i} \bigg|_{\varphi(\gxpy(0))} \cdot \td{}{t} \bigg( \varphi^i \circ \gxpy \bigg)\bigg|_{t = 0} \\ & = \sum_i \pd{f}{\varphi^i} \bigg|_{\varphi(\gxpy(0))} \cdot \gidxpy \end{align} \] Since \(\gdxpy\) can be written as \[ \begin{align} \gdxpy & = \td{}{t} \left( \varphi \circ \gxpy \right) \\ & = \td{}{t} \left( \varphi \circ \varphi^{-1} \left( (\varphi \circ \gx) +_{\R^d} (\varphi \circ \gy) \right) \right) \\ & = \td{}{t} \big( (\varphi \circ \gx) +_{\R^d} (\varphi \circ \gy) \big) \\ & = \td{}{t} (\varphi \circ \gx) +_{\R^d} \td{}{t} (\varphi \circ \gy) \\ & = \gdx +_{\R^d} \gdy \end{align}, \] we can plug into the expression above and write \[ \begin{align} (X + Y)(f) & = \sum_i \pd{f}{\varphi^i} \cdot \left( \gidx(0) +_{\R^d} \gidy(0) \right) \\ & = \sum_i \bigg( \pd{f}{\varphi^i} \cdot \gidx(0) +_{\R^d} \pd{f}{\varphi^i} \cdot \gidy(0) \bigg) \\ & = \td{}{t} \bigg( f \big( \gx(t) \big) \bigg) \bigg|_{t = 0} + \td{}{t} \bigg( f \big( \gy(t) \big) \bigg) \bigg|_{t = 0} \\ & = X(f) + Y(f) \end{align} \] Whew! Should we do scalar multiplication now? Let's leave that as an exercise. It's pretty similar. Anyway, with directional derivatives on a manifold in our toolkit, we are tantalizingly close to a generalized notion of gradient! We'll need just a few more structures in place to get there and prove interesting facts about it.
Chart-induced basis of \(\tpm\)
Given a vector space, it's natural to seek a basis. Equipped with a chart \((U, \varphi)\) covering a point \(p \in U\), we can induce a canonical basis on the tangent space \(\tpm.\) The idea is to form tangent vectors from curves which vary only one coordinate. Recall that we can think of the chart map \(\varphi: U \to \R^d\) as a collection of \(d\) "coordinate" maps \(\varphi^i: U \to \R\). Define a set of curves \(\chi^i: \Ieps \to \M\) by
\[ \chi^i(t) = \left(\varphi^i\right)^{-1}(\varphi^i(p) + t). \]
The \(\sim\)-equivalence classes corresponding to these curves form a basis for \(\tpm\). To prove this requires showing that any \(\sim\)-equivalence class of curves \([\gamma]_\sim\) can be written as a linear combination of the \([\chi^i]_\sim\):
\[ [\gamma]_\sim = \sum_{i=1}^d c_i [\chi^i]_\sim. \]
The proof follows straightforwardly from the definitions, taking
\[ c_i = \left(\td{}{t}\left(\varphi^i \circ \gamma\right)\right)\bigg|_{t=0}. \]
It's common to denote the basis vectors in \(\tpm\) by \(\fpd{}{\varphi^i} := [\chi^i]_\sim.\) This works out nicely notationally. It lets us write, e.g. \[ \begin{align} X(f) & = \left( \sum_i X^i \pd{}{\varphi^i} \right)(f) \\ & = \sum_i X^i \cdot \left( \pd{}{\varphi^i} (f) \right) \\ & = \sum_i X^i \cdot \td{}{t}\left( f \circ \gamma_{\varphi^i}(t) \right) \bigg|_{t = 0} \\ & = \sum_i X^i \cdot \left( \sum_j \pd{f}{\varphi^j} \bigg|_{\varphi^j(p)} \cdot \td{}{t} \left( \varphi^j \circ \gamma_{\varphi^i}(t) \right) \bigg|_{t = 0} \right). \\ & = \sum_i X^i \cdot \left( \sum_j \pd{f}{\varphi^j} \bigg|_{\varphi^j(p)} \cdot \delta^j_i \right) \\ & = \sum_i X^i \cdot \pd{f}{\varphi^i} \bigg|_{\varphi(p)} \end{align} \] Formally, this appears trivial, but recall that \(\fpd{f}{\varphi^i}\) is not as simple as it looks -- we defined it above as a shorthand for \(\fpd{}{\varphi^i}(f \circ \varphi^{-1}).\) As nice as the standard notation of modern differential geometry is, it's important to constantly remember that everything in sight has been redefined in terms of compositions with curves, charts and inverse charts!
Cotangent Spaces: Duals to Tangent Spaces
Let \(V\) be a vector space over a field \(K.\) The dual space to \(V\) is the set \(V^*\) of linear \(K\)-valued functions on \(V.\) That is to say for any \(X, Y \in V\), \(\omega \in V^*\), and \(a, b, \in K\), we have \[ \omega : V \to K \\ \omega(a X + b Y) = a \omega(X) + b \omega(Y) \] The elements of \(V^*\) are called covectors. It's staightforward to show that such linear functions themselves form a vector space over the field \(K.\) \(V^*\) and \(V\) are dual in the sense that just as \(V^*\) is a set of linear maps from \(V\) to \(K\), we can view \(V\) as a set of linear maps from \(V^*\) to \(K\): \[ X : V^* \to K, \quad \omega \mapsto \omega(X) \] In the specific case of a smooth manifold, we call the dual to \(\tpm\), the cotangent space to \(\M\) at \(p\), and denote it \(\tpsm.\)
We can use the coordinate basis on \(\tpm\) to construct a basis for \(\tpsm.\) Consider a vector \(X = \sum_i X^i \fpd{}{\varphi^i}.\) We have a collection of natural mappings \(X \mapsto X^i\) given by the basis. These mappings are linear (in \(X\)) in that \(aX + bY \mapsto aX^i + bY^i.\) These mappings are by definition, then, covectors! We denote them, somewhat cryptically, as \(d\varphi^i.\) We can then write \[ X = \sum_i d\varphi^i(X) \pd{}{\varphi^i} \] and note that they satisfy \[ d\varphi^i\left(\pd{}{\varphi^j}\right) = \delta^i_j. \] These covectors \(d\varphi^i\) form a basis on \(\tpsm.\) Given any \(\omega \in \tpsm\), we can write \[ \omega = \sum_i \omega_i d\varphi^i \] by virtue of the action of \(\omega\) on an arbitrary vector \(X\): \[ \begin{align} \omega(X) & = \omega\left(\sum_i X^i \pd{}{\varphi^i}\right) \\ & = \omega \left(\sum_i d\varphi^i(X) \pd{}{\varphi^i} \right) \\ & = \sum_i \omega \left( \pd{}{\varphi^i} \right) d\varphi^i(X) \end{align} \] where the last inequality follows from linearity of \(\omega.\) Then, independent of any particular \(X\) we may write \[ \omega = \sum_i \omega_i d\varphi^i \] where we define the components \(\omega_i = \omega(\fpd{}{\varphi^i}).\)
Tensors on a Manifold
Tensors generalize vectors and covectors. Vectors in a \(K\)-vector space linearly map covectors to the field \(K\), and covectors linearly map vectors to the field \(K.\) Tensors are multilinear maps on a Cartesian product of several vector and covector spaces. Given a vector space \(V\), a \((p, q)\)-tensor is a map \[ T: \underbrace{V^* \times \cdots \times V^*}_{p \: \mathrm{terms}} \times \underbrace{V \times \cdots \times V}_{q \: \mathrm{terms}} \to K \] Multilinear means it's linear in each of its arguments. So, \[ T(\ldots, aX + bY, \ldots) = aT(\ldots, X, \ldots) + bT(\ldots, Y, \ldots). \]
Metrics on a Manifold
As it stands, we have a way of creating vectors and vector spaces at each point in our manifold, but we lack a measure of magnitude of a vector. We might try to reuse our above trickery and, say, inherit a notion of length from \(\R^d\) by using the chart. This actually can work, locally, but the result is completely dependent on our choice of atlas. In order to impart an intrinsic (chart-independent) notion of vector length in our space, it's necessary to impose a metric structure. A smooth manifold with a metric is called a Riemannian manifold. The form of the metric is, at every point \(p \in \M\), a bilinear symmetric function \(g : \tpm \times \tpm \to \R.\) Depending on context one may require that a metric be
- positive definite (stronger): \(g(X, X) = 0 \iff X = 0\)
- non-degenerate (weaker): \(X \neq 0 \implies \exists Y \mid g(X, Y) \neq 0.\)
The former corresponds to the common notion of spatial distance, while the latter is most well-known in relativistic physics where time and space are on different footing, giving rise to metric signature \((+, -, -, -).\) In the weaker "non-degenerate" case, it's possible to have non-zero vectors of zero length (e.g., the "null" or "lightlike" vectors of relativity), but there'll always be at least one vector \(Y\) onto which such an \(X\) has non-zero projection. In the stronger "positive definite" case, zero length implies the vector is the zero vector. In linear algebra-speak, positive definite metrics are inner products (hey, that's the name of this blog!) defined on the tangent space at each point. Non-degenerate metrics are pseudo-inner products.
Using the chart-induced basis for \(\tpm\), we can derive a component representation for the metric function by applying it to pairs of basis vectors: \[ g_{ij} = g\left(\pd{}{\varphi^i}, \pd{}{\varphi^j}\right). \] The \(g_{ij}\) form a \(d \times d\), symmetric non-degenerate (ie, no zero eigenvalues) matrix. Given vectors \(X = \sum_i X^i \fpd{}{\varphi^i}\), and \(Y = \sum_i Y^i \fpd{}{\varphi^i}\) applying bilinearity of \(g\), we see that \[ \begin{align} g(X, Y) & = g\left( \sum_i X^i \fpd{}{\varphi^i}, \sum_j Y^j \fpd{}{\varphi^j}\right) \\ & = \sum_i \sum_j X^i Y^j g(\fpd{}{\varphi^i}, \fpd{}{\varphi^j}) \\ & = \sum_{i,j} g_{ij}X^i Y^j \end{align} \]
Musical isomorphisms
As a bilinear function on pairs of tangent vectors, the metric \(g\) at a point \(p\) is a \((2, 0)\)-tensor. If we fix a vector \(X \in \tpm\), we can consider the function \(g(X, \cdot) : \tpm \to \R\) resulting from fixing one argument of \(g\) at \(X.\) Since \(g\) is bilinear (linear in each of its arguments), \(g(X, \cdot)\) is a linear map. In other words, it corresponds to some \(\omega \in \tpsm\). In this sense, \(g\) gives rise to a map from \(\tpm\) to \(\tpsm.\) This is in fact an isomorphism, called a musical isomorphism. The above map is written \(\flat: \tpm \to \tpsm\), and its inverse is \(\sharp: \tpsm \to \tpm.\) (The musical notation derives from the notational convention of writing indices of vector components in the upper position and indices of covector components in lower position. The action of the musical isomorphisms is to bump these indices up and down, much like the action of the sharp and flat notation on musical notes.)
We can compute coordinate representations of the musical isomorphisms. Recall that the coordinates of a covector are recovered by letting it act on the chart-induced basis vectors \(\fpd{}{\varphi^i}\). Hence \[ \begin{align} \flat(X) & = \sum_i \flat(X)\left(\fpd{}{\varphi^i}\right) d\varphi^i \\ & = \sum_i g\left(X, \fpd{}{\varphi^i}\right) d\varphi^i \\ & = \sum_i g\left(\sum_j X^j \fpd{}{\varphi^j}, \fpd{}{\varphi^i}\right) d\varphi^i \\ & = \sum_{i, j} X^j g\left( \fpd{}{\varphi^j}, \fpd{}{\varphi^i}\right) d\varphi^i \\ & = \sum_{i, j} X^j g_{ji} d\varphi^i \end{align} \] So we can write \(\flat(X)\) in the chart-induced basis are (exchanging the indices on \(g\) by symmetry) as \[ \flat(X) = \sum_{i, j} g_{ij} X^j d\varphi^i \]
In this way, in the chart-induced basis, we can think of the musical isomorphism \(\flat\) as equivalent to multiplying a vector's components by the matrix representation of the metric.
We can compute the components of the inverse, \(\sharp\), by noting that it should invert the \(\flat\) map above: \[ X = \sharp(\flat(X)) \] or, in the chart-induced basis, \[ \sum_k X^k \pd{}{\varphi^k} = \sharp(\sum_{i, j} g_{ij} X^j d\varphi^i). \] By linearity of \(\sharp\) we have \[ \sum_k X^k \pd{}{\varphi^k} = \sum_{i, j} g_{ij} X^j \sharp(d\varphi^i). \]
Now, we can expand each of the vectors \(\sharp(d\varphi^i)\) in terms of the basis vectors \(\fpd{}{\varphi^j}.\) Anticipating our end result, we will re-use the name \(g\) for the components, but with upper indices: \[ \sharp(d\varphi^i) = \sum_k g^{ik} \pd{}{\varphi^k} \] Plugging this into the above and then rearranging, we get \[ \begin{align} \sum_k X^k \pd{}{\varphi^k} & = \sum_{i, j} g_{ij} X^j \left( \sum_k g^{ik} \pd{}{\varphi^k} \right). \\ & = \sum_k \left( \sum_{i, j} g^{ik} g_{ij} X^j \right) \pd{}{\varphi^k}. \end{align} \] from which we can extract the identity \[ X^k = \sum_{i, j} g^{ik} g_{ij} X^j. \] From this we can see that the \(g^{ik}\) are the components of the matrix inverse of \(g_{ij}\). Now we can write \(\sharp(\omega)\) for arbitrary \(\omega \in \tpsm\) as \[ \begin{align} \sharp(\omega) & = \sharp\left(\sum_i \omega^i d\varphi^i \right) \\ & = \sum_i \omega^i \sharp(d\varphi^i) \\ & = \sum_i \omega^i \left(\sum_k g^{ik} \pd{}{\varphi^k} \right) \\ & = \sum_{i,k} g^{ik} \omega^i \pd{}{\varphi^k} \end{align} \]
Gradient of a Smooth Function on \(\M\)
We're almost there! We now define the gradient covector of a smooth real-valued function \(f\) at a point \(p\) on a smooth manifold \(\M\) as the unique member of the cotangent space \(\tpsm\) whose value, when applied to a vector \(X \in \tpm\), is the directional derivative of \(f\) in the direction of \(X\) (note that this directional derivative is just a single real number). We denote it \(d_pf\): \[ d_pf : \tpm \to \R \\ d_pf(X) = X(f) = \td{}{t}\big(f(\gx(t)\big)\bigg|_{t = 0} \] We should convince ourselves that this thing is linear, which we can see as follows. \[ \begin{align} d_pf(aX + bY) & = (aX + bY)(f) \\ & \: = aX(f) + bY(f) \\ & \: = a d_pf(X) + b d_pf(Y) \end{align} \]
In the chart-induced basis, we can write \[ \begin{align} d_pf(X) = X(f) & = \sum_i X^i \cdot \pd{f}{\varphi^i} \bigg|_{\varphi(p)} \\ & = \sum_i d\varphi^i(X) \cdot \pd{f}{\varphi^i} \bigg|_{\varphi(p)} \end{align} \] So that, irrespective of X, we can write \[ d_pf = \sum_i \pd{f}{\varphi^i} \bigg|_{\varphi(p)} d\varphi^i \] So, we see that the components of the gradient covector at \(p\), in the chart-induced co-basis, are precisely the partial derivatives of \(f\) (really, the partial derivatives of \(f \circ \varphi^{-1}\) with respect to the coordinates on \(\R^d\)).
Now, we can use tangent vectors at \(p\) to move infinitesimal distances in the direction given by the vector, simply by adding the vector components (in the chart-induced basis!) to the (chart-induced!) coordinates of \(p\). But we can't simply add the components of a covector to the coordinates of a point (well, we could but it would be meaningless). Instead, we apply our musical isomorphism to convert the gradient covector into a member of \(\tpm\), which gives us, at last, the method of natural gradient adaptation.
Natural Gradient Adaptation
We're now in a position to formally define the natural gradient method described in (Amari, 1998): given a real-valued function \(f : \M \to \R\) on a Riemannian manifold \((\M, g)\), the natural gradient at a point \(p\) is defined to be the vector \(\sharp(d_pf).\) But...why?
We saw that the gradient of \(f\) at \(p\) is the unique covector in \(\tpsm\) which maps each vector \(X \in \tpm\) to the directional derivative of \(f\) in the direction of \(X\) (this is a single real number). The natural gradient is the image of the gradient under the \(\sharp\) map. We can interpret this as the unique vector in \(\tpm\) whose (pseudo-)inner product with any vector \(X\) is the directional derivative in the direction of \(X.\) Well, it turns out this vector \(\sharp(d_pf)\) points in the direction of most rapid change of \(f\)!
Let's prove this. We wish to find a vector \(X\) in whose direction the derivative of \(f\) is extremized, subject to the constraint that the length of \(X\) be 1, i.e. \(g(X, X) = 1.\) Fix a chart \((U, \varphi)\) around \(p.\) In the chart-induced basis, we can write \(X = \sum_i X^i \fpd{}{\varphi^i}\) and \(g(X, X) = \sum_{ij} g_{ij} X^i X^j.\) We can write the directional derivative of \(f\) as \[ X(f) = \sum_i X^i \pd{f}{\varphi^i} \] Our goal is to find the \(X^i\)'s. To formulate the problem as a constrained optimization, we employ the method of Lagrange multipliers to write an objective function \[ \mathcal{J}(X^1, \ldots, X^d) = \sum_i X^i \pd{f}{\varphi^i} + \lambda \left( \sum_{i, j} g_{ij} X^i X^j - 1 \right) \] and set its derivative with respect to the \(X^i\)'s equal to zero. \[ \pd{\mathcal{J}}{X^i} = \pd{f}{\varphi^i} + 2 \lambda \left( \sum_j g_{ij} X^j \right) = 0 \] This is readily solved for the components of \(X\) in terms of the inverse metric matrix components as follows: \[ \sum_j g_{ij} X^j = -\dfrac{1}{2 \lambda}\pd{f}{\varphi^i} \\ \sum_i g^{ki} \sum_j g_{ij} X^j = -\dfrac{1}{2 \lambda} \sum_i g^{ki}\pd{f}{\varphi^i} \\ \sum_i \sum_j g^{ki} g_{ij} X^j = -\dfrac{1}{2 \lambda} \sum_i g^{ki}\pd{f}{\varphi^i} \\ \sum_j \delta^k_j X^j = -\dfrac{1}{2 \lambda} \sum_i g^{ki}\pd{f}{\varphi^i} \\ X^k = -\dfrac{1}{2 \lambda} \sum_i g^{ki} \pd{f}{\varphi^i} \] The value of \(\lambda\) comes from plugging back in the constraint, namely that the vector be of unit length. Putting this back in terms of the basis, and recalling the expression for the gradient of \(f\), we have \[ \sum_k X^k \pd{}{\varphi^k} = -\dfrac{1}{2 \lambda} \sum_{i,k} g^{ki} \pd{f}{\varphi^i} \pd{}{\varphi^k} \] where the sum on the right is precisely \(\sharp(d_pf)\): \[ \begin{align} \sharp(d_pf) & = \sharp \left( \sum_i \pd{f}{\varphi^i} d\varphi^i \right) \\ & = \sum_i \pd{f}{\varphi^i} \sharp (d\varphi^i) \\ & = \sum_{i,k} g^{ik} \pd{f}{\varphi^i} \pd{}{\varphi^k}. \end{align} \]
Conclusion
We have introduced many of the basic notions of Riemannian geometry; the definition of smooth manifold, tangent vector spaces and their dual spaces, metric tensors, gradient covectors and the natural gradient vector. We've shown that the natural gradient vector corresponds to the direction of most rapid change in a function on a Riemannian manifold. This has all been quite abstract, however, so in a follow-up post, I'll cover some concrete examples to help illustrate how this works in practice.
Acknowledgements
I started this post a really long time ago and just got around to actually publishing it. I really appreciate the early feedback I got from Richie Stebbing, Rif A. Saurous, and Alex Davies, as well as the encouragement of Aleks Goeva. Please feel free to hit me up with any feedback you may have, either in the comments, or via email/twitter (links below). Thanks for reading!