[{"content":"As research labs are beginning to phase out Vision Language Action (VLA) models in favor of the newer \u0026ldquo;world models\u0026rdquo;, this is a perfect time to look back on a field that has had some time to mature, and put together some notation that makes it easier to talk about the central concepts. Most of the technical content is borrowed from the papers by Physical Intelligence , some of it comes from my background in control theory.\nStates \u0026amp; Observations Let $\\mathscr S$ be the state space of a robot. The state at time $t$ is written as $S_t$, which is an $\\mathscr S$-valued random variable. For example, $S_t$ may be the robot\u0026rsquo;s joint angles at times $$ \\begin{array}{c@{\\,,\\ }c@{\\,,\\ }c@{\\,,\\ }c@{\\,,\\ }c@{}c} t-(\\mathrm n - 1)\\tau \u0026 \\cdots \\,\u0026 t-2\\tau \u0026 t-\\tau \u0026 t \\end{array} $$ where $\\tau$ is a timestep. In this case, $\\mathscr S=(SO(2))^{\\mathrm j\\mathrm n}$, $\\mathrm j$ is the number of joints, and $\\mathrm n$ is the number of timesteps. The state could also include the last $\\mathrm n$ image frames seen by the robot\u0026rsquo;s camera(s) and a natural language description of the task assigned to the robot. The $\\pi_{0.6}^\\ast$ paper refers to $S_t$ as the \u0026ldquo;observation\u0026rdquo;, but we will call it the \u0026ldquo;state\u0026rdquo; for simplicity.\nIn control theory, the state-transition map tells you how the state of a system evolves over time, subject to some state-dependent control inputs. We can formulate something similar in the stochastic setting using a stochastic differential equation (SDE) or its corresponding Fokker-Planck equation. Actions The goal of learning-based control is to learn a policy, which is a mapping that takes $S_t$ and outputs an action $A_t$. A policy enables the robot to interact with (and respond to changes in) its environment. We may expect our mapping from $S_t$ to $A_t$ to be deterministic, nevertheless, it proves to be more convenient to consider a probabilistic policy. Let the policy $\\pi$ be the conditional pdf $\\pi(a,s)\\coloneqq f_{A_t|S_t}(a|s)$ that tells you the probability density of taking the action $a$ given the state $s$. As the notation suggests, we have assumed that $\\pi$ does not explicitly depend on time (though one could always cheat and include $t$ itself as a variable in $S_t$).\nDiffusion and flow-matching policies became extremely fashionable in 2025. In fact, there exists a continuous spectrum of generative policies that interpolates between diffusion (SDE-based generation) and flow-matching (ODE generation with random initial condition). Generative policies learn how to generate samples from $\\pi(\\,\\cdot\\,,s)$ for each $s\\in\\mathscr S$, rather than learning the pdf $\\pi$ directly. While the generative model serves as the action head, the observations are passed through a vision encoder that goes into a VLM (vision-language model), which in turn steers the action head via something like FiLM .\nTrajectories I will use the following notation to represent a family of random variables indexed by time, called a stochastic process: $$ S_{[0,\\infty)}\\coloneq (S_t)_{t\\in[0,\\infty)} $$ The family $S_{[0,\\infty)}$ is assumed to satisfy a Markov property:1 $$ f_{S_{t+\\Delta t}| S_{[0,t]}}(s|{\\mathscr s}) = f_{S_{t+\\Delta t}|S_t}\\big(s|{\\mathscr s}(t)\\big), $$ where ${\\mathscr s}: [0,t] \\rightarrow \\mathscr S$ is a curve (i.e., a trajectory) in $\\mathscr S$. In practice, we can augment the state with additional information to ensure a Markov-like property.\nTake a second to digest this notation. The subscripts for \"$f$\" indicate that two different functions are used on either side — on the left, and we are evaluating the conditional pdf of $S_{t+\\Delta t}$ (a random variable) conditioned on $S_{[0,t]}$ (a random variable), and we are evaluating this pdf at $(s,{\\mathscr s})$. That is, $S_t$ is a random variable while $s$ represents an arbitrary point in $\\mathscr S$. Similarly, $S_{[0,t]}$ is a random variable and ${\\mathscr s}$ is a sample of this random variable. We can replace $[0,t]$ with $\\lbrace 0, 1, 2, \\ldots, {\\rm k}\\rbrace$ to recover the discrete-time formulation. For instance, a discrete-time trajectory may be viewed as a map $\\lbrace 0, 1, 2 \\ldots \\rbrace \\rightarrow \\mathscr S$.\nDatasets Given an underlying distribution over initial conditions (represented by $S_0$), a policy $\\pi$, and a stochastic dynamical model for the system (e.g., an underlying SDE), we get a combined stochastic process $(S_t,A_t)_{t\\in[0,\\infty)}$. This is the random state-action trajectory. Conversely, if we have a dataset of state-action trajectories, we can learn the $\\pi$ that would generate them; this is behavior cloning. We can also make small perturbations to $\\pi$ to see if the resulting trajectories improve upon some reward function (in expectation); this is reinforcement learning. Regularization can be introduced to ensure that the perturbed policy doesn\u0026rsquo;t deviate too far from some baseline policy (as done in TRPO and PPO). The regularization term is typically an information-theoretic divergence between $(S_t,A_t)_{t\\in[0,\\infty)}$ and the baseline policy; the divergence between two distributions measures how much they differ from each other.\nAction Chunking Do you (presumably a human) also operate on such a latency? In your case, $\\delta$ is perhaps the time delay between when your Team Fortress 2 enemy first appears on-screen to when you begin moving the reticle towards their head. It is impressively small! In practice, the roboticist may only be able to specify $f_{A_{t+\\delta}|S_t}(a|s)$ due to latency issues! Since doing model inference takes an $\\delta \u003e0$ amount of time, it is impossible to act at time $t$ based on the information at $t$.2 Inference is expensive; if we inferenced a VLA model at each timestep, then the frequency of our model\u0026rsquo;s outputs will be inherently limited by how fast inference takes. The (currently) most well-known workaround is to ensure that each inference call produces a chunk of actions, i.e., a sequence of actions to be executed over the next $\\mathrm h_{\\text{pr}}$ timesteps, where $\\mathrm h_{\\text{pr}}$ is called the prediction horizon. This is action chunking. In practice, it\u0026rsquo;s common to execute only $\\mathrm h_{\\text{ex}} \u003c \\mathrm h_{\\text{pr}}$ actions from the chunk before requesting a new chunk of actions from the model, where $\\mathrm h_{\\text{ex}} \\approx \\mathrm h_{\\text{pr}}/2$.3 We call $\\mathrm h_{\\text{ex}}$ the execution horizon.\nHowever, what happens at the end of the chunk? There are two issues here:\nAfter executing a chunk, we still need to wait $\\delta$ milliseconds to get the next chunk of actions. Our robot hasn\u0026rsquo;t been told what to do during this time; if $\\delta \u003e \\tau$, then there will be a noticeable pause in the robot\u0026rsquo;s motion. Our VLA model doesn\u0026rsquo;t remember what it spat out during the last chunk, so the next chunk it produces will not continuously or smoothly align with the previous chunk. In my opinion, these aren\u0026rsquo;t very serious issues; as long as you have a robust VLA model you can tell your robot to do the dishes and go to bed \u0026ndash; the robot will do the dishes overnight (albeit a little more loudly than you\u0026rsquo;d like). These are however very serious issues at robotics labs like Physical Intelligence because it makes for some seriously unimpressive demos.\nThe idea of real-time chunking (RTC) is to do the following. Let $\\mathrm d \\coloneq \\lfloor \\delta / \\tau \\rfloor$ be the number of timesteps that the model takes to do its inference, i.e., the inference delay. Suppose we already have this sequence of actions at timestep $t$: $$ \\begin{array}{ccccc} a_{t} \u0026 a_{t + \\tau} \u0026 a_{t + 2\\tau} \u0026 \\cdots \u0026 a_{t + (\\mathrm h_{\\text{pr}}-1) \\tau} \\end{array} $$ Then, in two parallel threads, do the following (assuming $\\mathrm h_{\\text{pr}}\\geq \\mathrm h_{\\text{ex}} + \\mathrm d$):\nexecute the first $\\mathrm h_{\\text{ex}}$ actions in the chunk after $a_{t + (\\mathrm h_{\\text{ex}}-1) \\tau}$ is executed, initiate another call for inference; condition this inference on the next $\\mathrm d$ actions in the chunk: $$ \\begin{array}{cccc} a_{t+\\mathrm h_{\\text{ex}} \\tau} \u0026 a_{t +(\\mathrm h_{\\text{ex}} + 1)\\tau} \u0026 \\cdots \u0026 a_{t + (\\mathrm h_{\\text{ex}}+\\mathrm d - 1) \\tau} \\end{array} $$ as these actions will be executed while the inference is happening! The conditioning step can be viewed as freezing the next $\\mathrm d$ actions of the previous chunk and inpainting the rest of the action chunk, using the same algorithms as those used for image inpainting. This way, we have new chunk of actions waiting for us at time $t+(\\mathrm h_{\\text{ex}}+\\mathrm d)\\tau$, and this new chunk will (due to the inpainting) smoothly continue from the previous chunk. The robot thinks as it acts.\nThe paper I linked above is actually better-described as inference-time RTC . It uses a gradient-based inpainting method to fill in the missing actions. Basically, the score (or velocity) vector in the diffusion (or flow-matching) process incurs an additional term calculated via a vector-Jacobian product, ensuring that the denoised (or flow-matched) chunk begins with the $\\mathrm d$ frozen actions. So \u0026ldquo;conditioning\u0026rdquo; here refers to the gradient-based correction term that is used to inpaint the prefix of the action chunk.\nTraining-time RTC is concerned with the problem that the gradient-based inpainting is expensive. Their solution is to simply not denoise the first $\\mathrm d$ actions during inference. So, the adjectives inference-time vs. training-time is a bit misleading; both algorithms require some inference-time modifications over the vanilla diffusion (or flow matching) VLA policy. While IT-RTC doesn\u0026rsquo;t require any training-time changes, TT-RTC may need small architectural and training-time changes. That said, the main difference between IT-RTC and TT-RTC is in how they hold the frozen actions in place during inference.\nRewards \u0026amp; Value Functions Given a state-action pair $(s,a)$, let the pdf $f_{S_{t+\\tau}|S_t}(s'|s,a)$ be the state-transition function \u0026ndash; it gives the probability density of ending up at $s'$ in the next time instant. Assume for simplicity that our system is time-invariant, i.e., $f\\coloneq f_{S_{t+\\tau}|S_t}$ does not depend on $t$.4\nWe can now introduce well-known definitions from reinforcement learning:\nThe reward function is a map $R:\\mathscr S\\times \\mathscr A \\rightarrow \\mathbb R$. The value function $V^\\pi$ is the expected total reward when following policy $\\pi$. It satisfies $$V^\\pi(s)=\\mathbb E_{a\\sim\\pi(\\,\\cdot\\,|s)}\\left[R(s,a)+\\gamma \\,R'(s,a)\\right]$$ where $R'$ is the \u0026ldquo;expected future reward\u0026rdquo; $$R'(s,a)\\coloneqq\\mathbb E_{s'\\sim f(\\,\\cdot\\,|s,a)}\\left[V^\\pi(s')\\right]$$and $\\gamma\\in(0,1]$ is a discount factor. (This can work as a definition of $V^\\pi$; it is stated as an implicit/recursive definition for convenience.) The Q-function is the thing inside the expectation in the definition of $V^\\pi$, so we have $$V^\\pi(s)=\\mathbb E_{a\\sim\\pi(\\,\\cdot\\,|s)}\\left[Q^\\pi(s,a)\\right]$$ Note that $R$ satisfies $$R(s,a)=Q^\\pi(s,a)-\\gamma\\mathbb E_{s'\\sim f}\\left[V^\\pi(s')\\right]$$ but it doesn\u0026rsquo;t actually depend on $\\pi$, only the right-hand side does! The advantage function is $$\\begin{align}A^\\pi(s,a)\u0026=Q^\\pi(s,a)-V^\\pi(s)\\\\\u0026=Q^\\pi(s,a)-\\mathbb E_{a\\sim\\pi}\\big[Q^\\pi(s,a)\\big]\\end{align}$$ In a typical VLA pipeline, the policy is first learned via behavior cloning on a large dataset of demonstrations, then fine-tuned using RL (e.g., PPO) to improve task success rates. Chris Paxton has a nice writeup on the role of reinforcement learning in the age of behavior cloning.\nMy notation here implies the existence of a pdf $f_{S_{[0,t]}}({\\mathscr s})$ that should somehow \u0026ldquo;integrate to $1$\u0026rdquo;. What is its domain? To simplify things, we can write the trajectory-space as $\\mathcal C([0, t],\\mathscr S)$, assuming without much loss of generality that the trajectory ${\\mathscr s}\\in \\mathcal C([0, t],\\mathscr S)$ is continuous. Another possibility is to consider the trivial fiber bundle $\\mathscr S \\times \\mathbb R\\rightarrow \\mathbb R$. The space of trajectories is then the space of sections of this bundle. In the discrete-time case, we replace $\\mathbb R$ with $\\mathbb Z_{+}$.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nIn flow-based generation, we have to do 5 or more forward passes to get the denoised action chunk. Even OpenVLA (which I do not believe is flow-based) has $\\delta \\approx 320\\,ms$ after accounting for model inference, network latency, and other overheads. The VLM backbone is typically beefier than the action head, so a bulk of the inference overhead sits there.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nIf only $\\mathrm h_{\\text{ex}} \u003c \\mathrm h_{\\text{pr}}$ actions are executed, then why do we predict the remaining $\\mathrm h_{\\text{pr}} - \\mathrm h_{\\text{ex}}$ actions at all? My understanding is that this encourages the model to think long-term (a la model-predictive control) rather than resorting to a greedy policy.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nIf $\\pi^\\ast$ is a value-maximizing policy, it will satisfy $\\pi^\\ast(a|s)=\\delta(a-a^\\ast(s))$, where $\\delta$ is the Dirac delta function and $a^\\ast(s)=\\arg\\max_{a\\in\\mathscr A}Q(s,a)$. So, some of the expectations in the above definitions will turn into $\\max$es.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://shirazkn.github.io/posts/rl/","summary":"A review of the basic concepts that go into the design and training of VLA models.","title":"VLA Models"},{"content":"I\u0026rsquo;ve been going through Russ Tedrake\u0026rsquo;s notes on robotics, which got me thinking about their so-called monogram notation. Basically, this is the notation that represents an $SE(3)$ transformation as $^aX^b$ — more on that shortly. There is a recent review paper that points to several variations on this notation, and my postdoc advisor Gregory Chirikjian wrote a paper on it as well. Tedrake\u0026rsquo;s monogram notation looks very elegant until you get to spatial velocities, at which point strange artifacts such as the \u0026ldquo;cross product\u0026rdquo; will appear. $\\newcommand{\\mf}[1]{\\mathbf{#1}}$ $\\newcommand{\\mono}[4]{ { ^{#2} #1 ^{#3} _{#4} } }$\nThis got me thinking about the connections between the monogram notation and abstract Lie groups. The result of this deliberation was a new notation for spatial velocities that bridges the gap between abstract Lie groups and their applications to robotics and computer graphics. The notation I present here has the following characteristics:\nit is consistent with the monogram notation (with a minor caveat, noted in the appendix ) it makes the transformation rules for spatial velocities look like those of points, vectors, and frames, and (for those familiar with abstract Lie groups) it emphasizes the role of the adjoint representation of $SE(3)$. Chapter 1. The Frame Bundle First, let\u0026rsquo;s think about the space in which frames live. Given a point $\\mf p\\in \\mathbb R^3$, a frame $f$ at $\\mf p^f$ is an ordered set of $3$ linearly independent vectors $(\\mf f_1, \\mf f_2, \\mf f_3)$ along with the base point $\\mf p^f$. The frame is visualized as arrows starting at $\\mf p$ and facing outwards.1 The collection of all frames of $\\mathbb R^3$ makes up the frame bundle $\\mathrm F\\mathbb R^3$. There is a map $\\pi:\\mathrm F\\mathbb R^3\\rightarrow \\mathbb R^3$ that maps $f$ to its base point $\\mf p^f$, giving us a fiber bundle structure.\nThe contents of the frame $f\\in\\mathrm F\\mathbb R^3$ can be stacked into a matrix as follows: $$ f=\\begin{bmatrix} \\mf f_1 \u0026 \\mf f_2 \u0026 \\mf f_3 \u0026 \\mf p^f\\\\ 0 \u0026 0 \u0026 0 \u0026 1 \\end{bmatrix}. $$ Meanwhile, an element $(A, \\mf p)$ of the semi-direct product group $G\\coloneq GL(3)\\ltimes\\mathbb R^3$2 can be written as\n$$ (A, \\mf p) = \\begin{bmatrix} A \u0026 \\mf p\\\\ \\mf 0 \u0026 1 \\end{bmatrix}. $$ The transformation $(A, \\mf p)$ acts on $f$ from the right, giving us another frame that we denote as $g$: $$ \\begin{align} f \\cdot (A, \\mf p) \u0026= \\begin{bmatrix} \\mf f_1 \u0026 \\mf f_2 \u0026 \\mf f_3 \u0026 \\mf p^f\\\\ 0 \u0026 0 \u0026 0 \u0026 1 \\end{bmatrix}\\begin{bmatrix} A \u0026 \\mf p\\\\ \\mf 0 \u0026 1 \\end{bmatrix}\\\\ \u0026\\eqcolon \\begin{bmatrix} \\mf g_1 \u0026 \\mf g_2 \u0026 \\mf g_3 \u0026 \\mf p^g\\\\ 0 \u0026 0 \u0026 0 \u0026 1 \\end{bmatrix}\\\\ \u0026=g. \\end{align} $$Conversely, given a pair of frames $f,g$, there is a unique transformation ${\\mono Xfg~}\\coloneq(\\mono Afg~, \\mono {\\mf p}fgf)$ in $G$ that takes $f$ to $g$. Equivalently, ${^fX^g}$ is the frame $g$ as seen from $f$. As we will see, this notation lets us write the transformation rule for frames as $\\mono Xeg~ = \\mono Xef~ \\mono Xfg~$.\n${^fX^g}$ is the transformation that takes $f$ to $g$\nequivalently, it's $g$ \"as seen from\" $f$ The action of $G$ is transitive and free (it doesn't really matter what these words mean, but I have included a handy guide in the appendix). In particular, $SE(3)=SO(3)\\ltimes \\mathbb R^3$ is a subgroup of $G$ that acts on $\\mathrm F\\mathbb R^3$, and it leaves the subset $\\mathrm O\\mathbb R^3\\subseteq \\mathrm F\\mathbb R^3$ of positively-oriented orthogonal frames invariant! We will later specialize our discussion to the action of $SE(3)$ on $\\mathrm O\\mathbb R^3$. Chapter 2. Transformation Rules In the last chapter, we implicitly assumed that there is a well-defined \u0026ldquo;origin frame\u0026rdquo; that we will refer to as $e$. This is the frame that, when represented as a $4\\times 4$ matrix, becomes the identity matrix. For our purposes, we can assume that this is a frame that is stationary with respect to Earth 🌏 and located at a convenient location. Perhaps $e$ is located at (and aligned with) the lower-left corner of my desk. Let $f$ be another frame that is attached to the fan on my ceiling.\nWe can describe the vector that goes from $e$ to $f$ as $\\mono {\\mf p}{e}{f}{~}$. Expressing this vector in the basis of $f$, we have $$ \\mono {\\mf p}{e}{f}{~} = \\mono {\\mf p}{e}{f}{\\mf f_1} \\mf f_1 + \\mono {\\mf p}{e}{f}{\\mf f_2} \\mf f_2 + \\mono {\\mf p}{e}{f}{\\mf f_3} \\mf f_3, $$ whose coefficients we can represent as a vector: $$ \\mono {\\mf p}{e}{f}{f}\\coloneq\\begin{bmatrix} \\mono {\\mf p}{e}{f}{\\mf f_1}\\\\ \\mono {\\mf p}{e}{f}{\\mf f_2}\\\\ \\mono {\\mf p}{e}{f}{\\mf f_3} \\end{bmatrix}. $$ We could also represent $\\mono {\\mf p}{e}{f}{~}$ in the origin (i.e., Earth) frame: $$ \\begin{align} \\mono {\\mf p}{e}{f}{~} \u0026= \\mono {\\mf p}{e}{f}{\\mf e_1} \\mf e_1 + \\mono {\\mf p}{e}{f}{\\mf e_2} \\mf e_2 + \\mono {\\mf p}{e}{f}{\\mf e_3} \\mf e_3\\\\ \u0026= \\begin{bmatrix} \\mono {\\mf p}{e}{f}{\\mf e_1}\\\\ \\mono {\\mf p}{e}{f}{\\mf e_2}\\\\ \\mono {\\mf p}{e}{f}{\\mf e_3} \\end{bmatrix} \\ \\\\ \u0026\\eqcolon \\mono {\\mf p}{e}{f}{e}. \\end{align} $$ As in Tedrake\u0026rsquo;s notes, if we omit a supercript/subscript, we mean the origin (or Earth) frame: $\\mono {\\mf p}{}{f}{}\\coloneq\\mono {\\mf p}{e}{f}{e}$. This means that $\\mono {\\mf p}{}{}{f}$ is the zero vector, and $\\mono X{}{}{}$ the identity matrix!\nNotice that $f$ has the same matrix as $\\mono X{}{f}{}$. We can view $f$ both as a frame and as a transformation from $e$ to $f$. Similarly, $\\mf p^f$ can be viewed both as a point in $\\mathbb R^3$ and as the vector going from the origin to $\\mf p^f$. The underlying reason is that, in order to describe to you which point (or frame) I am talking about, I need to first choose an origin (or Earth frame).3\nWe have the following identities for frames $f,g,h,$ and $k$.\n(Vector Addition) $\\mono {\\mf p}{f}{h}{k} = \\mono {\\mf p}{f}{g}{k} + \\mono {\\mf p}{g}{h}{k}$\n(Additive Inverse) $\\mono {\\mf p}{f}{g}{k} = - \\mono {\\mf p}{g}{f}{k}$\n(Inverse Transformation) $\\mono X{f}{g}{} = (\\mono X{g}{f}{})^{-1}$ Using these facts, we can derive the transformation rule for points. Let\u0026rsquo;s relate how the frames $e$ and $f$ see the point $\\mf p^g$: $$ \\begin{align} \\mono {\\mf p}~g~ \u0026= \\mono {\\mf p}~f~ + \\mono {\\mf p}fg e\\\\ \u0026= \\mono {\\mf p}~f~ + \\mono {\\mf p}fg{\\mf f_1}\\mf f_1 + \\mono {\\mf p}fg{\\mf f_2}\\mf f_2 + \\mono {\\mf p}fg{\\mf f_3}\\mf f_3\\\\ \u0026= \\mono {\\mf p}~f~ + \\mono{A}~f~ \\mono {\\mf p}fg{f}, \\end{align} $$ $$ \\text{or}\\quad \\begin{bmatrix} \\mono {\\mf p}e g e\\\\ 1 \\end{bmatrix} = \\mono Xef~ \\begin{bmatrix} \\mono {\\mf p}fg{f}\\\\ 1 \\end{bmatrix} $$where $\\mono X~f~=(\\mono{A}~f~, \\mono {\\mf p}~f~)$ and $\\mono{A}~f~\\in GL(3)$ is the matrix $$ \\mono{A}~f~ = \\begin{bmatrix} \\mf f_1 \u0026 \\mf f_2 \u0026 \\mf f_3 \\end{bmatrix}. $$ What we wrote above is just the last column of the formula\u0026hellip;\n(Composition) $\\mono Xeg~ = \\mono Xef~ \\mono Xfg~$ While $\\mono Xef~$ translates between how $e$ and $f$ each see the frame $g$, we may be interested in translating between how $e$ and $f$ see the vector going from $f$ to $g$, which is $\\mono {\\mf p}fg~=\\mono {\\mf p}~g~ - \\mono {\\mf p}~f~$. We already know what this looks like: $$ \\mono {\\mf p}fge = \\mono{A}ef~\\mono {\\mf p}fgf. $$When $e$ and $f$ are both positively-oriented orthogonal frames, $\\mono{A}ef~$ is an orthogonal matrix, so we will write it as $\\mono{R}ef~$ instead:\n(Rotation-of-Basis) $\\mono {\\mf p}fge = \\mono{R}ef~\\mono {\\mf p}fgf$ In this case, $\\mono Xef~=(\\mono{R}ef~, \\mono {\\mf p}efe)\\in SE(3)$. These formulae hold if we replace $e$ with $h$; I just used $e$ to simplify the presentation.\nChapter 3. Velocities Suppose we have curves $f :\\mathbb R\\rightarrow \\mathrm F\\mathbb R^3$ and $g :\\mathbb R\\rightarrow \\mathrm F\\mathbb R^3$, such that $f(t)$ and $g(t)$ are moving w.r.t. the origin frame $e$. Our task is to describe their velocities at time $t$.\nSetting up the notation. I introduce the following notation for derivatives:\n$$ \\begin{align} (\\mono X{f}{g}~)' \u0026\\coloneq \\frac{d}{d\\tau}{\\mono X{f(t+\\tau)}{g(t+\\tau)}~}\\Big\\vert_{\\tau=0}\\\\ \\mono X{f}{g'}~ \u0026\\coloneq \\frac{d}{d\\tau}{\\mono X{f(t)}{g(t+\\tau)}~}\\Big\\vert_{\\tau=0}\\\\ \\mono X{f'}{g}~ \u0026\\coloneq \\frac{d}{d\\tau}{\\mono X{f(t+\\tau)}{g(t)}~}\\Big\\vert_{\\tau=0} \\end{align} $$which means that $(\\mono Xgg~)' = \\mf 0$, and so on. Each of these quantities is the component-wise derivative of a time-dependent matrix. By differentiating the identity $$ \\mono X{f(t)}{g(t+\\tau)}~ \\mono X{g(t+\\tau)}{f(t)}~ = \\mf I, $$we get $\\mono Xf{g'}~=-\\mono Xfg~\\,\\mono X{g'}f~\\,\\mono Xfg~$. In particular, $\\mono Xg{g'}~=-\\mono X{g'}g~$. I will not be using this notation much, but I found it to be very useful for deriving the forthcoming results.\nBasic physics tells us that, in order to even define what velocity is, we need to decide which frame we consider to be stationary. We will write $\\mono{\\Lambda}fgk$ to denote the velocity of $g$ as seen from $k$, with $f$ considered to be the stationary/reference frame. It is of the form $$ \\mono{\\Lambda}fgk = \\begin{bmatrix} \\mono{\\Omega}fgk \u0026 \\mono{\\mf v}fgk\\\\ \\mf 0 \u0026 0 \\end{bmatrix}\\in \\mathfrak{se}(3), $$ where $\\mono{\\Omega}fgk\\in\\mathfrak{so}(3)$ is a $3\\times 3$ skew-symmetric matrix. As we will see, $\\mono{\\mf v}fgk$ is not the same as Tedrake\u0026rsquo;s $\\mono{\\mf v}{f}{g}{k}$; what I write as $\\mono{\\mf v}fgk$ is what Tedrake writes as $\\mono{\\mf v}{f}{g_f}{k}$. I will justify my choice of notation shortly. Using the notation. We denote the velocity of $g$ as seen from itself, with $e$ considered as the stationary frame, as\n(Left-Invariant Velocity) $ \\mono{\\Lambda}{}gg\\coloneq \\mono Xg{g'}~ $ It follows that $\\mono X{g'}g~ = - \\mono{\\Lambda}{}gg$. Similar to how left-invariant vector fields work on abstract Lie groups, we have $$ \\begin{align} \\mono X~{g'}~=\\frac{d}{d\\tau}{\\mono X{e}{g(t)}~ \\mono X{g(t)}{g(t+\\tau)}~}\\Big\\vert_{\\tau=0} = \\mono X{}g~ \\mono{\\Lambda}{}gg. \\end{align} $$Thus, $\\mono{\\Lambda}{}gg=(\\mono X{}g~)^{-1}\\,\\mono X{}{g'}~$ is indeed the left-invariant velocity of $g$, with $e$ considered to be the stationary/reference frame. What is then the right-invariant velocity? It is the velocity of $g$ as seen from $e$, with $e$ considered the stationary frame:\n$$ \\mono{\\Lambda}ege\\coloneq \\mono X{e}{g}~ \\mono{\\Lambda}egg \\,\\mono Xge~. $$ The left-invariant velocity is the velocity of $g$ as seen from $g$, while the right-invariant velocity is the velocity of $g$ as seen from $e$. I have been cheeky in choosing $e$ and $g$ as the notation for these frames, since I want $e$ to remind you of the identity element of $SE(3)$! More generally, the adjoint representation gives us the velocity as seen from another frame. Recalling the definition of the adjoint representation , $\\mathrm{Ad}_X:\\mathfrak{se}(3)\\rightarrow\\mathfrak{se}(3)$ for each $X\\in SE(3)$, we have\n(Frame-Change for Velocities) $$ \\begin{align} \\mono{\\Lambda}ghe \u0026\\coloneq \\mono Xef~ \\,\\mono{\\Lambda}ghf \\,\\mono Xfe~\\\\ \u0026=\\mathrm{Ad}_{\\mono X{e}{f}{}}(\\mono{\\Lambda}ghf) \\end{align} $$ The nice symmetric notation helps us remember this formula, although I like the $\\mathrm{Ad}$ notation since it makes the transformation rules look like those of vectors and frames.\nHere is the big payoff of this notation. Velocities add like vectors, as long as they are expressed in the same frame:\n(Addition of Velocities) $ \\mono{\\Lambda}fhk = \\mono{\\Lambda}fgk + \\mono{\\Lambda}ghk $ You are encouraged to prove this! As a hint, notice that it is sufficient to prove this for the case of $k=f$. As an example of how to apply this formula, you can verify that\n$$ \\mono{\\Lambda}{f}{h}{h} = \\mono{\\Lambda}{}{h}{h} - \\mathrm{Ad}_{\\mono X{h}{f}{}}(\\mono{\\Lambda}{}{f}{f}), $$ or the left-invariant velocity of $\\mono X{f}{h}{}$ is the velocity of $h$ minus the velocity of $f$ (as seen from $h$).\nChapter 4. Hat \u0026amp; Vee Maps For an introduction to the hat and vee maps, see the books by Chirikjian or Barfoot, or Appendix A of my preprint . Basically, $\\mathfrak{so}(3)$ has a certain choice of basis that lets us map skew-symmetric matrices to vectors in $\\mathbb R^3$ using the vee map:\n$$ \\begin{align} \\begin{bmatrix} 0 \u0026 -\\omega_z \u0026 \\omega_y\\\\ \\omega_z \u0026 0 \u0026 -\\omega_x \\\\ -\\omega_y \u0026 \\omega_x \u0026 0 \\end{bmatrix}^\\vee = \\begin{bmatrix} \\omega_x\\\\ \\omega_y\\\\ \\omega_z \\end{bmatrix}, \\end{align} $$ and the hat map is its inverse, so that $\\mono{\\Omega}efg^\\vee = \\mono{\\boldsymbol\\omega}efg$ and $\\mono{\\boldsymbol\\omega}efg^\\wedge = \\mono{\\Omega}efg$. The reason for this peculiar choice of basis is that it defines a Lie algebra isomorphism from $\\mathfrak{so}(3)$ to $(\\mathbb R^3, \\times)$, where $\\times$ is the cross product:\n$$ \\begin{align} \\Omega_1 \\Omega_2 - \\Omega_2 \\Omega_1 = ({\\boldsymbol\\omega}_1 \\times {\\boldsymbol\\omega}_2)^\\wedge \\end{align} $$and has interesting properties, such as $(\\boldsymbol\\omega^\\wedge) \\mf p = {\\boldsymbol\\omega}\\times \\mf p$. We can extend this to define hat and vee maps for $\\mathfrak{se}(3)$: $$ (\\mono{\\Lambda}{f}{g}{k})^\\vee=\\begin{bmatrix} \\mono{\\boldsymbol\\omega}{f}{g}{k}\\\\ \\mono{\\mf v}{f}{g}{k} \\end{bmatrix} $$ which is the vector that Tedrake writes as $\\mono{\\mathrm V}{f}{g_f}{k}$ (recall that my $\\mono{\\mf v}{f}{g}{k}$ is Tedrake\u0026rsquo;s $\\mono{\\mf v}{f}{g_f}{k}$). This vee map satisfies the following property; basically the choice of basis for $\\mathfrak{se}(3)$ lets us write the adjoint action as a matrix:\n$$ \\begin{align} \\textrm{Ad}_{\\mono Xhg~}(\\mono \\Lambda efg) = \\left( \\begin{bmatrix} \\mono Rhg~ \u0026 \\mf 0 \\\\ (\\mono {\\mf p}hgh)^{\\hspace{-2pt}^\\wedge} \\,\\mono Rhg~ \u0026 \\mono Rhg~ \\end{bmatrix} \\begin{bmatrix} \\mono {\\boldsymbol\\omega}efg \\\\ \\mono {\\mf v}efg \\end{bmatrix} \\right)^\\wedge \\end{align} $$ Yes, that's a weird-looking matrix, and it is not clear why we are hitting $\\mono {\\mf p}hgh$ with the hat map. But it works! Using the cross-product properties given above, we get the formulae $$ \\begin{align} \\mono {\\boldsymbol\\omega}efh \u0026= \\mono Rhg~ \\mono {\\boldsymbol\\omega}efg\\\\ \\mono {\\mf v}efh \u0026= \\mono {\\mf p}hgh\\times (\\mono Rhg~ \\mono {\\boldsymbol\\omega}efg) + \\mono Rhg~ \\mono {\\mf v}efg \\\\ \u0026=\\mono {\\mf p}hgh\\times \\mono {\\boldsymbol\\omega}efh + \\mono Rhg~ \\mono {\\mf v}efg . \\end{align} $$This is where our notation departs from that of Tedrake. In our notation, the addition of velocities formula looks like\n$$ \\begin{align} \\mono {\\boldsymbol\\omega}fhk \u0026= \\mono {\\boldsymbol\\omega}fgk + \\mono {\\boldsymbol\\omega}ghk\\\\ \\mono {\\mf v}fhk \u0026= \\mono {\\mf v}fgk + \\mono {\\mf v}ghk, \\end{align} $$both angular and translational velocities add when they are expressed in the same frame.\nAppendix A. The Two Definitions This begs the question of what the term $\\mono {\\mf v}ABC$ even means, in Tedrake\u0026rsquo;s notation. In his notation, what I am claiming is that $ \\mono{\\mf v}{A}{C_A}A = \\mono{\\mf v}{A}{B_A}A + \\mono{\\mf v}{B}{C_B}A $, which is true! In other words, my $\\mono{\\mf v}{A}{B}A$ is what Tedrake writes as $\\mono{\\mf v}{A}{B_A}A$. Note that $\\mono{\\boldsymbol\\omega}ABA=\\mono{\\boldsymbol\\omega}A{B_A}A$ in both of our notations, but\n$\\mono{\\mf v}A{B}A\\,$$= \\mono{\\mf v}A{B_A}A + \\mono{\\boldsymbol\\omega}A{B}A \\times \\mono{\\mf p}A{B}A $. You can draw some diagrams to see that his choice for representing velocities is very reasonable from a physical standpoint. However, the notation I choose here is more compatible with the group structure of $SE(3)$, and makes the transformation rules look more natural.\nAppendix B. Group Actions An action of $G$ on $X$ is said to be\u0026hellip;\neffective (or faithful) if $g\\cdot x=x$ for all $x \\in X$ implies $g=e$. Equivalently, the mapping $G\\mapsto \\textrm{Aut}(X)$ is injective. “Everyone (in $G$) is doing something somewhere (in $X$)\" free if $g\\cdot x=x$ for some $x$ already implies that $g=e$. Given an element $g\\in G\\backslash\\lbrace e\\rbrace$, its action has no fixed points (is fixed-point free!) “Everyone's doing something everywhere\" transitive if any two points are connected by a group action. Consider the task of taking $x_1\\in X$ to $x_2\\in X$\u0026hellip; (Given an arbitrary task) “It will be done by someone\" regular if transitive and free. This makes $X$ a principal homogeneous $G$-space or a $G$-torsor ; the group element in the definition of transitivity is unique (Given an arbitrary task) “I know just the person!\" It turns out that such a frame can have one of two parities — it can either be positively or negatively oriented. Our usual choice of the $\\mathrm{xyz}$ axes is positively oriented, so you can use that fact to come up with your own \u0026ldquo;right-hand rule\u0026rdquo; for positive orientation.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nThink of this as the group of homogeneous transformations $SE(3)$, but with the rotation part generalized to an arbitrary $3\\times 3$ invertible matrix.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nAbstractly speaking, a group has a distinguished point (the identity), while on a homogeneous space we need to choose a reference point. In the context of my previous post , the Earth frame $e$ can be seen as a choice of a distinguished point in $\\mathrm F\\mathbb R^3$. It identifies $\\mathrm F\\mathbb R^3$ with $GL(3)\\ltimes \\mathbb R^3$. Similarly, we can identify $\\mathrm O\\mathbb R^3$ with $SE(3)$; given a frame $f\\in\\mathrm O\\mathbb R^3$, we associate it to the $SE(3)$ transformation $\\mono X~f~ = \\mono Xefe$. In other words, $\\mathrm O\\mathbb R^3$ is an $SE(3)$-torsor.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://shirazkn.github.io/posts/robotics/","summary":"I\u0026rsquo;ve been going through Russ Tedrake\u0026rsquo;s notes on robotics, which got me thinking about their so-called monogram notation. The result of this deliberation was a new notation for spatial velocities that bridges the gap between abstract Lie groups and their applications to robotics and computer graphics.","title":"Representing Spatial Velocities"},{"content":"Fiber Bundles A fiber bundle is a sequence of maps of the form\n$$ \\begin{array}{@{}c@{\\ }c@{\\,}c@{}} \\mathcal F \u0026 \\xhookrightarrow{\\quad\\iota\\quad} \u0026 \\mathcal E \\\\ \u0026 \u0026 \\bigg\\downarrow\\rlap{\\scriptstyle\\pi} \\\\[0.3ex] \u0026 \u0026 \\mathcal M \\end{array} $$where $\\mathcal F$ is the fiber, $\\mathcal E$ is the total space, and $\\mathcal M$ is the base space. As the notation suggests, $\\iota$ is an injective map referred to as the inclusion which \u0026ldquo;places\u0026rdquo; the fiber $\\mathcal F$ vertically inside $\\mathcal E$. Meanwhile, $\\pi$ is a surjective map known as the projection — given a point $q\\in\\mathcal E$, the point $\\pi(q)\\in\\mathcal M$ is the shadow cast by $q$ down on the base space.\nThe preimage under $\\pi$ of the point $p\\in\\mathcal M$ is diffeomorphic to $\\mathcal F$, and is called \u0026ldquo;the fiber above $p$\u0026rdquo;. The job of $\\iota$ is to tell us what \u0026ldquo;a typical fiber\u0026rdquo; in $\\mathcal E$ looks like. The typical fiber of the Hopf fibration is a circle:\nErr.. your browser doesn't support the video tag! The Hopf Fibration is the fiber bundle, $S^1\\hookrightarrow S^3 \\rightarrow S^2$. Here, I visualize points on $S^2$ and the corresponding fibers sitting above in $S^3$. The total space $\\mathcal E$ locally looks like $\\mathcal M \\times \\mathcal F$, but globally it can look quite different. For instance, a Möbius strip can be viewed as a fiber bundle over the circle, $\\mathcal M=S^1$, whose fiber is the closed unit interval $\\mathcal F=[0,1]$. Infinitely many copies of this fiber are glued together to create a fiber bundle. However, the Möbius strip is quite different from $S^1\\times[0,1]$, which is a cylinder. The latter is called a trivial fiber bundle. A formal definition of a fiber bundle can be found on Wikipedia — it relates the manifold structures (i.e., open sets and charts) of $\\mathcal F$, $\\mathcal E$, and $\\mathcal M$. See Qfwfq's explanation of how a principal bundle can be constructed by gluing charts; Stephen Bruce Sontz's book (which I thoroughly enjoyed!) takes a similar approach. Also see Frederic Schuller. A principal $G$-bundle is one whose fiber $\\mathcal F = G$ is a Lie group, and where each fiber has a right $G$-action that is compatible with the fiber bundle structure in a certain way. The reason why we consider a right action is that we can create principal bundles out of homogeneous spaces , and in doing so, the right-action on the fiber must play nice with the left-action of the homogeneous space. Let\u0026rsquo;s assume you know a bit about homogeneous spaces, and begin our story there.\nCoset Spaces A homogeneous space refers to a smooth manifold $\\mathcal M$ equipped with a left $G$-action (satisfying some axioms). We let $g\\in G$ act on $p\\in\\mathcal M$ to yield $g\\cdot p\\in \\mathcal M$. A homogeneous space does not have a distinguished point (think of a sphere, on which no point is more special than the other). However, if we do choose a special point $p_0\\in\\mathcal M$ (sort of like choosing the origin of a coordinate system), we can view $\\mathcal M$ as a coset space . We do so by considering the subgroup $G_{p_0} \u003c G$ of group actions that leave the point $p_0$ fixed — the stabilizer subgroup of $p_0$ in $G$. Denoting the subgroup $G_{p_0}$ as $H$, we have the following picture:\nwhich is that of a principal $H$-bundle. The map $\\pi$ sends $g\\in G$ to the point $g \\cdot p_0$, and we observe that $\\pi(g)=\\pi(gh)$ for any $g\\in G$ and $h\\in H$. We define a section (think cross-section) of this bundle to be a map $\\sigma$ that is a right-inverse of $\\pi$, i.e., $\\pi \\circ \\sigma=\\mathrm{id}_{G/H}$ is the identity map on $G/H$.1\nExample 1 (Spheres). $SO(3)$ acts on the sphere $S^2$. Given a point $p\\in S^2$, the rotations about the axis passing through $-p$ and $p$ are the stabilizer subgroup $SO(3)_p$. Then, $SO(3)/SO(3)_p$ is isomorphic to $S^2$, and we get a fiber bundle $$ SO(2) \\hookrightarrow SO(3) \\rightarrow S^2. $$ Observe that $SO(3)_p\\cong SO(2)$. Due to this isomorphism, you will often see the claim $SO(3)/SO(2) \\cong S^2$. However, it is important to note that there is no unique map $SO(2) \\overset{\\iota}{\\rightarrow} SO(3)$ (any one-parameter subgroup of $SO(3)$ is isomorphic to $SO(2)$)!\nExample 2 (Hopf Fibration). The set of unit quaternions2 $\\mathbb Q_S$ can be identified with the matrix Lie group $SU(2)$. Quaternions act on points in $\\mathbb R^3$ via rotations, with the quaternions $q,-q$ representing the same rotation. Suppose $H\\leq SU(2)$ is all the rotations about a fixed axis, let\u0026rsquo;s say the rotations about the $\\mathrm{z}$-axis. The resulting bundle $H\\hookrightarrow SU(2) \\rightarrow S^2$ is what is visualized in the video above (also see the appendix)! Note that $H\\cong SO(2)\\cong S^1$.\nExample 3 (SPD Matrices). Let the set of $n\\times n$ symmetric positive-definite matrices be written as $\\mathrm{SPD}(n)$. Given a matrix $g\\in GL(n)$, it acts on $\\boldsymbol\\Sigma\\in\\mathrm{SPD}(n)$ as $\\boldsymbol\\Sigma \\mapsto \\mathbf g\\boldsymbol\\Sigma \\mathbf g^\\top$, where $g\\mapsto \\mathbf g$ is the standard representation of $GL(n)$ as $n\\times n$ matrices. To see the coset space structure of $\\mathrm{SPD}(n)$, we need to choose a distinguished point on it, and the fiber sitting above it will be the group of elements that leave the distinguished point fixed. I leave this as an exercise, with a hint in the footnotes!3\nConversely, given a Lie group $G$ and a subgroup $H\\leq G$, we can view $G/H$ as a homogeneous space. A point in $G/H$ is of the form $gH$. The fiber sitting above $gH$ may also be denoted as $gH$, but technically refers to the subset $$\\pi^{-1}[\\lbrace gH\\rbrace] = \\lbrace g h \\mathrel| h\\in H \\rbrace \\subseteq G,$$ a left coset of $H$ in $G$. The map $\\pi$ projects $g$ to its coset $gH\\in G/H$ (as explained in one of my earlier posts ), with the identity coset $eH$ viewed as being a distinguished point on $G/H$. Importantly, we also have a right $H$-action on the group. This right-action preserves fibers, since for every $gh_1\\in gH$ and $h_2\\in H$, we have $(gh_1)h_2 \\in gH.$ Thus, given a subgroup $H \u003c G$ we not only get a homogeneous space $G/H$, but we can also endow $G$ with the structure of a principal $H$-bundle. The homogeneous space structure lets us move \u0026ldquo;horizontally\u0026rdquo; in $G$ using the left $G$-action, while the principal bundle structure lets us move \u0026ldquo;vertically\u0026rdquo; using the right $H$-action. The following examples are the two extreme cases where we can only move horizontally or vertically. In Example 4, the fiber is trivial, and all of the content of the total space $G$ is in the base space — vice versa for Example 5.\nExample 4 ($G$-Torsor). This is the principal $\\lbrace e\\rbrace$-bundle where $G$ sits \u0026ldquo;horizontally\u0026rdquo;, defined by $e\\hookrightarrow G \\rightarrow G/\\lbrace e \\rbrace$. When $G=(\\mathbb R^n, +)$ (where the group operation is vector addition) and $e=\\mathbf 0$ (the zero vector of $\\mathbb R^n$), the principal $\\mathbb R^n$-torsor we constructed above is the $n$-dimensional affine space (without the scalar multiplication operation). In other words, this is $\\mathbb R^n$ with the origin \u0026ldquo;forgotten\u0026rdquo;.\nExample 5 (Principal $G$-Bundle over a Point). This is the fiber bundle where $G$ sits \u0026ldquo;vertically\u0026rdquo; over a single point, defined by $G\\hookrightarrow G \\rightarrow \\lbrace 🐈 \\rbrace$. We can replace 🐈 by any other object, doesn\u0026rsquo;t matter.\nPrincipal Bundles In the above, we constructed the fiber bundle $H\\hookrightarrow G \\rightarrow G/H$. We showed that we can then view $G/H$ as being a homogeneous space with a left $G$-action, and $G$ as a principal $H$-bundle with a right $H$-action that preserves fibers. Turns out that these objects carry too much information, and we need to discard some of this information to narrow our focus. For instance, $G/H$ is technically only a homogeneous space after we stop viewing the identity coset $eH$ as being a special point. As Example 4 shows, every Lie group can itself be viewed as a homogeneous space once we \u0026ldquo;forget\u0026rdquo; the identity element. Similarly, a general principal $H$-bundle may not come with a left $G$-action, as exemplified in the next section.\nExample 6 (Tangent Bundle). Let $\\mathcal M$ be a smooth $n$-dimensional manifold. The tangent space at a point $p\\in \\mathcal M$ is denoted as $T_p\\mathcal M$, and the union of these tangent spaces is written as $T\\mathcal M$. We can define a unique smooth structure on $T\\mathcal M$ that makes the following maps smooth:\n$${\\mathbb R}^n \\hookrightarrow T\\mathcal M \\rightarrow \\mathcal M. $$This fiber bundle has a fiber-preserving $GL(n)$ action — matrix-vector multiplication. However, $T\\mathcal M$ is not a principal $GL(n)$-bundle since its fibers are not isomorphic to $GL(n)$. Nevertheless, there is indeed a principal $GL(n)$-bundle closely related to $T\\mathcal M$, which is\u0026hellip;\nThe Frame Bundle Consider an ordered set of $n$ linearly independent vectors in $T_p\\mathcal M$. One such set is called a frame. The space of all the frames at $p$ make up the fiber, and the union of these fibers leads to the notion of a frame bundle $F\\mathcal M$. More formally, we can view a frame at $p\\in\\mathcal M$ as being a linear isomorphism from $\\mathbb R^n$ to $T_p\\mathcal M$. The frame bundle is then (as a set) the following object: $$ F\\mathcal M = \\left\\lbrace (\\hspace{1pt}p, f)\\,\\mathrel|\\,p\\in\\mathcal M,\\ f:\\mathbb R^n \\rightarrow T_p\\mathcal M \\textup{ is an isomorphism}\\right\\rbrace $$The frame bundle has an action of $GL(n)$. We need to take care that this is a right-action and not a left-action, the difference being that the group element $gh \\in GL(n)$ should act on a vector $(p,v)\\in T\\mathcal M$ as $$(\\hspace{1pt}p,v)\\cdot gh= \\left((\\hspace{1pt}p,v)\\cdot g\\right)\\cdot\\, h$$ On the other hand, a left-action of $gh$ would look different — $h$ will act first, then $g$.\nDo we have a group action on $F\\mathcal M$ that moves between fibers, like we did on a coset space? That is, is there some map $\\Phi$ that takes $(p_1,f_1)$ to $(p_2,f_2)$ with $p_1 \\neq p_2$? Not necessarily. If such a map did exist, then $\\pi \\circ \\Phi$ would be a group action on $\\mathcal M$. This is where the linear-isomorphisms interpretation of the frame bundle comes in handy; given an $n\\times n$ matrix $\\mathbf A\\in GL(n)$, we let $$(\\hspace{1pt}p,f)\\mathrel\\cdot\\mathbf A = (\\hspace{1pt}p,f\\circ \\mathbf A),$$ where $[f\\circ \\mathbf A](\\mathbf v) = f(\\mathbf A \\mathbf v)$. There is a way to view $T\\mathcal M$ as a fiber bundle that is \u0026ldquo; associated \u0026rdquo; to the frame bundle. We say that $GL(n)$ is the structure group of $T\\mathcal M$, and we call $T\\mathcal M$ a $GL(n)$-bundle. An associated bundle has a left action of its structure group. If that is confusing as hell, then you\u0026rsquo;re on the right track.\nEhresmann Connections Think of the fibers as trees in a forest. Consider the plight of our monkey, Ehresmann . It is clear to him how to climb up and down a tree (using the right $H$-action), but Ehresmann wants to jump from one tree to another. He needs to decide what constitutes horizontal movement! Will he jump such that his height off the ground is the same before and after the jump? Will he jump so that the tree length $\\ell$ (as measured from the ground) is the same? Perhaps he will jump in a direction perpendicular to his tree?\nAssume Ehresmann the monkey is at the point $q \\in \\mathcal E$. The space of all the directions in which he could jump constitute the tangent space, $T_{q}\\mathcal E$. The vertical subspace $\\mathrm{Ver}_q\\mathcal E$ is all the directions that keep Ehresmann on the same tree. It is precisely the vectors $v$ that satisfy $d\\pi_q(v)=0$; these directions do not change Ehresmann\u0026rsquo;s coordinates as measured along the ground. The vertical subspaces, collectively called the vertical subbundle $\\mathrm{Ver}\\mathcal E$, are well-defined as soon as we define a principal bundle.\nThe horizontal subbundle is something that must be chosen, similar to how we choose an inner product on a vector space. Turns out that there are a few different, equivalent ways of specify a horizontal subbundle for a principal $H$-bundle:\nspecify a horizontal subspace $\\mathrm{Hor}_q\\mathcal E \\subseteq T_{q}\\mathcal E$ for each $q\\in\\mathcal E$ such that $$T_q\\mathcal E =\\mathrm{Hor}_q\\mathcal E\\oplus \\mathrm{Ver}_q\\mathcal E,$$ choose a linear projection $\\Gamma:T\\mathcal E \\rightarrow \\mathrm{Ver}\\mathcal E$ and define $\\mathrm{Hor}\\mathcal E \\coloneq \\mathrm{ker}(\\Gamma)$,4 write down a $T\\mathcal E$-valued one-form $\\Gamma$ on $\\mathcal E$ that satisfies $\\Gamma\\vert_{\\mathrm{Ver}\\mathcal E}=\\mathrm{id}_{\\mathrm{Ver}\\mathcal E}$ (i.e., $\\Gamma$ restricted to $\\mathrm{Ver}\\mathcal E$ is the identity map on $\\mathrm{Ver}\\mathcal E$), or write down an $\\mathfrak h$-valued one-form $\\Gamma$ on $\\mathcal E$. It takes some self-introspection and a few dozen cups of coffee to see why these are indeed equivalent objects (i.e., neither of them specifies too much or too little structure!). An Ehresmann connection is a choice of horizontal subbundle that satisfies a few additional properties (cf. Sec. 10.1), most notably the $H$-invariance condition: $$ (R_g)_*\\mathrm{Hor}_q\\mathcal E = \\mathrm{Hor}_{qg}\\mathcal E \\quad \\forall q\\in\\mathcal E, g\\in H, $$ where $R_g:\\mathcal E \\rightarrow \\mathcal E$ is the right-action of $g\\in H$ on $\\mathcal E$, and $(R_g)_*$ is the corresponding pushforward map on tangent spaces. This condition ensures that the right $H$-action on $\\mathcal E$ preserves all the horizontal subspaces. Note that we can write an analogous condition for $\\mathrm{Ver}\\mathcal E$, but since the vertical subbundle does not depend on a choice of connection, it is automatically satisfied!\nThe $H$-invariance condition can also be written in terms of the connection form: $$ R_g^* \\Gamma = \\mathrm{Ad}_{g^{-1}} \\circ \\Gamma \\quad \\forall g\\in H. $$ It's easy for me to say \"$\\mathfrak h$-valued one-form\" and leave the rest to your imagination. However, there is a crucial aspect of the exterior algebra that must be modified when moving from $\\mathbb R$-valued one forms to Lie-algebra valued one-forms. Can you figure out what it is before clicking on the link? The Maurer-Cartan Form When we used a coset space to create a principal $H$-bundle, we used the sequence of smooth maps $H\\hookrightarrow G \\rightarrow G/H.$ This gives us a corresponding sequence of linear maps: $$\\mathfrak h \\,\\hookrightarrow \\,\\mathfrak g \\,\\rightarrow \\,\\mathfrak g/\\mathfrak h,$$ which is a short exact sequence in the category of vector spaces. Here, $\\mathfrak g/\\mathfrak h$ does not necessarily have a canonically defined Lie bracket for the same reason that $G/H$ is not necessarily a Lie group. Furthermore, there is generally no canonical way of viewing the quotient vector space $\\mathfrak g/\\mathfrak h$ as a subspace of $\\mathfrak g$ — this is a choice that must be made. It is precisely the choice that is made by an Ehresmann connection.\nBut that is not the topic of this section\u0026hellip; in this section, we will consider the \u0026ldquo;completely vertical\u0026rdquo; bundle $G\\hookrightarrow G \\rightarrow \\lbrace 🐈 \\rbrace$ from Example 5. This is the principal $G$-bundle that has a single fiber sitting above 🐈. You might say that there is no choice of horizontal subspace to be made here, since $TG=VG$. But there is indeed a choice of horizontal subspace, it just happens to be the trivial choice of $HG=\\lbrace \\mathbf 0 \\rbrace$. Since there is exactly one Ehresmann connection on this bundle, and given the various equivalent ways of defining a connection, what is the corresponding $\\mathfrak g$-valued one-form?\nGiven a vector $X\\in\\mathfrak g$, we can drop a vector field $\\widetilde X$ on $G$ using its right-action on itself: $$\\widetilde X_g = \\frac{d}{dt}g\\exp(tX)\\Big|_{t=0}.$$ This happens to be the left-invariant vector field (LIVF) of $X$. The Ehresmann connection is then the map $\\omega:TG \\rightarrow \\mathfrak g$ that sends $\\widetilde X$ to $X$: $$ \\omega(\\widetilde X) = X, $$ and is called the Maurer-Cartan form of $G$. Observe that $\\omega_g=L_{g^{-1}*}$. It is an exercise for the reader to verify that the Maurer-Cartan form satisfies the $G$-invariance condition $$ R_h^* \\omega_g = \\mathrm{Ad}_{h^{-1}} \\omega_{gh^{-1}} $$ for all $g\\in G$.5 The uniqueness of the Maurer-Cartan form (as an Ehresmann connection on the Example 5 bundle) is shown in Theorem 10.3 of Sontz . At the 22 minute mark , Schuller shows a connection between $\\omega$ and local representations of connections.\nThe exterior derivative of an Ehresmann connection is related to the curvature of the horizontal subbundle; I characterize the exterior derivative of $\\omega$ in Appendix B.\nThe Christoffel Symbols What we call \u0026ldquo;a connection\u0026rdquo; in differential geometry can be viewed as an Ehresmann connection on the frame bundle. Given a coordinate chart $(U, x^i)$ on $\\mathcal M$, we get a local section $\\sigma$ of $F\\mathcal M$ over $U$, defined by $$ \\sigma(p) = \\left(\\hspace{1pt}p, \\frac{\\partial}{\\partial x^1}\\Big|_p, \\cdots, \\frac{\\partial}{\\partial x^n}\\Big|_p\\right). $$ Given a connection form $\\boldsymbol\\Gamma$ on $F\\mathcal M$, we can pull it back under $\\sigma$ to get $\\boldsymbol{\\overline\\Gamma}\\coloneq \\sigma^\\ast \\boldsymbol\\Gamma$. We can then express it in the coordinate frame as $\\boldsymbol{\\overline\\Gamma}=(\\boldsymbol{\\overline\\Gamma})_{i}dx^i$. But ${\\overline\\Gamma}$ takes values in $\\mathfrak {gl}(n)$, which means that we can write $\\boldsymbol{\\overline\\Gamma} = {{\\overline\\Gamma}}^m_{ni}dx^i$. The coefficients ${{\\overline\\Gamma}}^m_{ni}$ are precisely the Christoffel symbols of the connection in the coordinate chart $(U,x^i)$. This formulation of the Christoffel symbols shows why they are not tensorial quantities; the $m$ and $n$ indices correspond to the $\\mathfrak{gl}(n)$-valued nature of the connection form, while the $i$ index is tensorial.\nA choice of horizontal subbundle lets us parallel transport vectors along curves on $\\mathcal M$. Given a curve $\\gamma:[0,1]\\rightarrow \\mathcal M$ and a frame $e\\in F_{\\gamma(0)}\\mathcal M$, we can lift $\\gamma$ to a curve $\\widetilde \\gamma:[0,1]\\rightarrow F\\mathcal M$ such that $\\pi \\circ \\widetilde\\gamma = \\gamma$, $\\widetilde\\gamma(0)=e$, and $\\widetilde\\gamma'(t) \\in \\mathrm{Hor}_{\\widetilde\\gamma(t)}F\\mathcal M$ for all $t\\in[0,1]$. This is called the horizontal lift of $\\gamma$ starting at $e$. The frame at $\\widetilde\\gamma(1)$ is then the parallel transport of the frame at $\\widetilde\\gamma(0)$ along $\\gamma$. The illustration at the top of this post also explains holonomy: the observation that parallel transporting a vector along a loop may not return the vector to its original position.\nAppendix A. Visualizing the Hopf Fibration Here are some functions in Julia that I used to animate the Hopf fibration:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 # section σ:S²→S³ function σ(p₁,p₂,p₃) 1/sqrt(2*(1+p₃)) * quat(1 + p₃, -p₂, p₁, 0) end # stabilizer subgroup of (0,0,1) H = [quat(cos(angle), 0, 0, sin(angle)) for angle in range(0, 2π, 1000)] # preimage of {p} under π:S³→S² function π⁻¹(p) p₁,p₂,p₃ = p f_quat = σ(p₁,p₂,p₃) .* H map(f_quat) do q r = real(q) (acos(r)/(π * sqrt(1-r^2)) .* imag_part(q) .+ (0, 0, 1.6)) .* 1.5 end end # parametrization of the sphere Ψ:ℝ²→S³ function sphere(ϕ,θ) (sin(ϕ)*cos(θ), sin(ϕ)*sin(θ), cos(ϕ)) end Appendix B. Maurer-Cartan Structure Equation Given a basis $(E_i)_{i=1}^n$ for $\\mathfrak g$, the Maurer-Cartan form can be expressed as $$\\omega\\coloneqq E_i \\varepsilon^i,$$which eats a vector field and returns a function from $G$ to $\\mathfrak g$, i.e., $\\omega:\\mathfrak X(TG)\\rightarrow C^\\infty(G,\\mathfrak g)$. Some of its properties are:\nIt is left-invariant since its components in the left-invariant coframe are constants (rather than functions). If it eats a LIVF, then the resulting function is a constant: $\\omega(\\widetilde X)=X^i E_i=X$. An arbitrary vector field can be written as $f^i\\widetilde E_i$, where $f^i:G\\rightarrow \\mathbb R$ need not be constant. We have, $\\omega(f^i\\widetilde E_i)=f^i\\omega(\\widetilde E_i)=f^i E_i.$ Basically, $\\omega$ trivializes the tangent bundle $TG$ into $G\\times \\mathfrak g$. The exterior derivative of $\\omega$ is defined via the formula in Prop. 14.29 of Lee ISM : $$d\\omega(\\widetilde X,\\widetilde Y)=E_i d\\varepsilon^i(\\widetilde X,\\widetilde Y)=E_i\\Big(\\widetilde X\\big(\\varepsilon^i(\\widetilde Y)\\big)- \\widetilde Y\\big(\\varepsilon^i(\\widetilde X)\\big)-\\varepsilon^i([\\widetilde X,\\widetilde Y])\\Big).$$ Because $\\varepsilon^i(\\widetilde X)$ and $\\varepsilon^i(\\widetilde Y)$ are constant, we have $$ \\begin{align} d\\omega(\\widetilde X,\\widetilde Y)\u0026=-E_i\\varepsilon^i([\\widetilde X,\\widetilde Y])\\\\ \u0026=-X^j Y^k c_{jk}^i E_i\\\\ \u0026=-[\\omega(\\widetilde X), \\omega(\\widetilde Y)], \\end{align} $$ where $c_{jk}^i$ are the structure constants of $\\mathfrak g$ in the basis $(E_i)_{i=1}^n$. This equation is called the Maurer-Cartan structure equation. For any two $\\mathfrak g$-valued forms, we can define $$ \\begin{align} \\big(\\,\\omega\\wedge \\eta\\,\\big)(\\widetilde X,\\widetilde Y)\u0026\\coloneqq[\\omega(\\widetilde X),\\eta(\\widetilde Y)]-[\\omega(\\widetilde Y),\\eta(\\widetilde X)]\\\\ \u0026=[\\omega(\\widetilde X),\\eta(\\widetilde Y)]+[\\eta(\\widetilde X),\\omega(\\widetilde Y)]. \\end{align} $$ Setting $\\eta=\\omega$, we get $$ \\begin{align} \\big(\\,\\omega\\wedge \\omega\\,\\big)(\\widetilde X,\\widetilde Y)=2[\\omega(\\widetilde X),\\omega(\\widetilde Y)]. \\end{align} $$Hence, we can write $d\\omega=-\\frac{1}{2}\\omega\\wedge \\omega=-\\frac{1}{2}E_ic^i_{jk}\\varepsilon^j\\wedge \\varepsilon^k$.6\nSuppose we choose a matrix representation $\\boldsymbol\\Phi:G\\rightarrow GL(m,\\mathbb R)$. Then, $d\\boldsymbol\\Phi_e$ is a representation of $\\mathfrak g$. Let $\\boldsymbol\\omega\\coloneqq d\\boldsymbol\\Phi_e\\circ\\omega$ be a $\\mathfrak{gl}(m)$-valued one-form on $G$. We have, $$ \\newcommand{\\mf}[1]{\\mathbf{#1}} \\begin{align} \\boldsymbol\\omega \u0026=\\big(d\\boldsymbol\\Phi_eE_i\\big)\\varepsilon^i=\\mf E_i \\varepsilon^i,\\\\ d\\boldsymbol\\omega(\\widetilde X,\\widetilde Y) \u0026=-[\\boldsymbol\\omega(\\widetilde X), \\boldsymbol\\omega(\\widetilde Y)]=-\\big(\\boldsymbol\\omega(\\widetilde X) \\boldsymbol\\omega(\\widetilde Y)-\\boldsymbol\\omega(\\widetilde Y) \\boldsymbol\\omega(\\widetilde X)\\big). \\end{align} $$Finally, we will choose a parametrization $\\psi:\\mathbb R^n\\rightarrow G$ and pull back $\\boldsymbol\\omega$ and $d\\boldsymbol\\omega$ under $\\psi$. First, we do this for $\\boldsymbol\\omega$: $$ \\begin{align} \\psi^\\ast\\boldsymbol\\omega \u0026=\\mf E_i\\big(\\psi^\\ast\\varepsilon^i\\big) \\end{align} $$Let $\\psi^\\ast\\varepsilon^i=\\epsilon_{i}dx^i$. Then $$\\big(\\psi^\\ast \\varepsilon^i\\big)\\left(\\frac{\\partial}{\\partial x^k}\\right)=\\epsilon_k=\\varepsilon^i\\left(\\psi_\\ast \\frac{\\partial}{\\partial x^k}\\right)=\\left((\\psi)^{-1}\\psi_\\ast \\frac{\\partial}{\\partial x^k}\\right)^{\\vee^i}=\\mf J^i_k$$ in the notation of my previous post . Hence, $$ \\begin{align} \\psi^\\ast\\boldsymbol\\omega \u0026= (\\boldsymbol\\psi)^{-1}\\frac{\\partial\\,\\boldsymbol\\psi}{\\partial x^k}dx^k, \\end{align} $$which is a $d\\boldsymbol\\Phi_e(\\mathfrak{g})$-valued one-form on the parameter space $\\mathbb R^n$, where $\\boldsymbol\\psi\\coloneqq\\boldsymbol\\Phi \\circ \\psi$.\nLet $\\boldsymbol\\omega^\\vee \\coloneqq (\\,\\cdot\\,)^\\vee\\circ d\\boldsymbol\\Phi_e\\circ\\omega$. Then, we have $\\psi^\\ast \\boldsymbol\\omega^\\vee(\\mf y)=J(\\mf x)\\mf y$ for all $\\mf y\\in T_{\\mf x}\\mathbb R^n$, or in terms of components, $\\psi^\\ast \\boldsymbol\\omega^\\vee=\\mf J_jdx^j$, where $\\mf J_j$ is the $j^{th}$ column of $\\mf J$.\nPullbacks commute with the exterior derivative and linear transformations, which simplifies the pullback of $d\\boldsymbol\\omega$:\n$$ \\begin{align} \\psi^\\ast d\\boldsymbol\\omega^\\vee \u0026=d \\left(\\psi^\\ast \\boldsymbol\\omega^\\vee\\right)\\\\ \u0026 = \\frac{\\partial\\mf J_j }{\\partial x^i}dx^i\\wedge dx^j\\\\ \u0026=\\frac{\\partial}{\\partial x^i}\\left((\\boldsymbol\\psi)^{-1}\\frac{\\partial\\,\\boldsymbol\\psi}{\\partial x^j}\\right)dx^i\\wedge dx^j\\\\ \u0026=\\left(-(\\boldsymbol\\psi)^{-1}\\frac{\\partial\\,\\boldsymbol\\psi}{\\partial x^i}(\\boldsymbol\\psi)^{-1}\\frac{\\partial\\,\\boldsymbol\\psi}{\\partial x^j}+(\\boldsymbol\\psi)^{-1}\\frac{\\partial^2\\,\\boldsymbol\\psi}{\\partial x^i\\partial x^j}\\right)dx^i\\wedge dx^j \\end{align}$$ On the other hand, we can also move $d$ \u0026lsquo;inwards\u0026rsquo; past the linear operators: $$ \\begin{align} \\psi^\\ast d\\boldsymbol\\omega^\\vee \u0026= \\psi^\\ast d\\left((\\,\\cdot\\,)^\\vee\\circ d\\boldsymbol\\Phi_e\\circ\\omega\\right)\\\\ \u0026=\\psi^\\ast\\big(d\\boldsymbol\\Phi_e(d\\omega)\\big)^\\vee \\\\ \u0026= -\\frac{1}{2}\\mf e_ic^i_{jk}\\psi^\\ast(\\varepsilon^j\\wedge \\varepsilon^k)\\\\ % \u0026= -\\frac{1}{2}\\mf e_ic^i_{jk}(\\psi^\\ast\\varepsilon^j\\wedge \\psi^\\ast\\varepsilon^k)\\\\ \u0026=-\\frac{1}{2}\\mf e_ic^i_{jk}(\\mf J^j_\\ell dx^\\ell\\wedge \\mf J^k_m dx^m). % \\\\ % \u0026=-\\frac{1}{2}\\mf e_ic^i_{jk}\\mf J^j_\\ell\\mf J^k_m\\, dx^\\ell\\wedge dx^m \\end{align} $$Hence, $$ \\begin{align} \\frac{\\partial\\mf J_i}{\\partial x^j} - \\frac{\\partial\\mf J_j}{\\partial x^i}\u0026=\\,\\mf e_\\ell c^\\ell_{mk}\\mf J^k_j\\mf J^m_i. \\end{align} $$ Theorem 21.17 of John M. Lee\u0026rsquo;s Smooth Manifolds describes the sense in which $G/H$ is a smooth manifold — there is a unique smooth structure on it that makes $\\pi$ a smooth map.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nI denote $\\mathbb Q_S$ as such partly because reads \u0026ldquo;quaternion sphere\u0026rdquo;, and partly because the \u0026lsquo;$S$\u0026rsquo; in the Special Linear group $SL(n)$ corresponds to an analogous constraint — that of restriction to the determinant one matrices.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nChoose the distinguished point as the $n\\times n$ identity matrix. What is the set of matrices that leaves this matrix unchanged under the group action, $\\boldsymbol\\Sigma \\mapsto \\mathbf g\\boldsymbol\\Sigma \\mathbf g^\\top$? This is a well-known subgroup of $GL(n)$! With this choice of the distinguished point, $\\pi$ is given by $\\mathbf g \\mapsto \\mathbf g \\mathbf g^\\top$.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nIt is a subtle point that, even though the vertical subbundle $\\mathrm{Ver}\\mathcal E$ is well-defined, the projection of an arbitrary vector $v\\in T_q\\mathcal E$ onto $\\mathrm{Ver}_q\\mathcal E$ depends on the choice of horizontal subbundle/connection.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nThe action of the left-hand side on an arbirary vector $v\\in T_{gh^{-1}} G$ is $$ R_h^* L_{g^{-1}*}(v) = L_{g^{-1}*} R_{h*}(v), $$ while the action of the right-hand side is $$ \\mathrm{Ad}_{h^{-1}} L_{(gh^{-1})^{-1} *}(v) = L_{h^{-1}*} R_{h*} L_{h*}L_{g^{-1}*}(v), $$ and the fact that the $L$ and $R$ maps commute gives us the desired equality.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nThe usual definition of a wedge product does not work for $\\mathfrak g$-valued one-forms. In fact, $\\alpha \\wedge \\alpha=0$ for the ordinary wedge product, which uses scalar multiplication! For $\\mathfrak g$-valued one-forms, we replace scalar multiplication with the only other bilinear operation available to us, the Lie bracket.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://shirazkn.github.io/posts/bundles/","summary":"A principal bundle is a fiber bundle, that additionally has the right-action of a group that preserves fibers. Beginning with coset spaces, we look at some interesting examples of principal bundles and the things we can do with them.","title":"Principal Bundles"},{"content":"I collect here some results about the differentials of the $\\exp$ and $\\log$ maps of a Lie group. The reader must be familiar with how the differential (i.e., pushforward map) is defined on a general manifold for much of this to make any sense. I will also rely extensively on the interpretation of tangent vectors as equivalence classes of curves ; Wikipedia has an excellent summary of this, as does my blog (if I may say so myself).\nLet $L_g$ refer to the left-multiplication operation: $L_g(h)=gh$ for $g,h\\in G$. Given $X\\in\\mathfrak g$, $X^L_g\\coloneq dL_g(X)$ is the corresponding left-invariant vector field. For a matrix Lie group (i.e., where $g$ is identified with some finite-dimensional matrix representation of it), $X^L_g=g X$, which makes sense because we have an external algebraic structure (namely, the multiplication of $n\\times n$ matrices, which has the structure of a unital associative algebra ) that lets us multiply group elements with Lie algebra elements. This rather convenient structure does not exist for a general Lie group. I will assume that the maps $\\exp: U \\rightarrow V$ and $\\log:V\\rightarrow U$ are inverses of each other, where $U\\subseteq \\mathfrak g$ is a neighborhood of $0$ and $V\\subseteq G$ is a neighborhood of $e$.\n🀲 Differential of $\\exp$ Suppose $\\exp(X)$ is a point on $G$. The differential of $\\exp$ provides an answer to the following question: what happens when $X$ is perturbed in the direction $Y \\in T_{X}\\mathfrak g$? The result, referred to as the pushforward of $Y$ under $\\exp$, lies in $T_{\\exp(X)} G$. It is a linear map between vector spaces:\n\\[ d\\exp_X: T_{X}\\mathfrak g \\rightarrow T_{\\exp X} G. \\] Since $\\mathfrak g$ is a vector space, we can identify $T_X\\mathfrak g$ with $\\mathfrak g$. Using the map $dL_{{\\exp(X)}^{-1}}=dL_{{\\exp(X)}}^{-1}$, we can identity $T_{\\exp X} G$ with $T_e G$, which is also just $\\mathfrak g$. A pedagogical nightmare ensues.\nLet $\\gamma(t)$ be a curve on $\\mathfrak g$. We use the notation $[\\gamma]$ to refer to the equivalence class of all the curves1 that pass through $\\gamma(0)$ with the velocity $\\dot \\gamma(0)$.2 We say that $\\gamma$ is a representative of the class $[\\gamma]$. If $\\gamma$ is chosen such that $\\gamma(0)=X\\in\\mathfrak g$ and $\\dot \\gamma(0)=Y \\in T_X\\mathfrak g$, then $[\\gamma]$ is completely characterized by $Y$ and vice versa; therefore, we can write $[\\gamma]=Y$. The image of $\\gamma$ under $\\exp$, which is $\\exp\\circ \\gamma(t)$, is a curve on $G$ as well as a representative of $d\\exp_X(Y)$. That is, $$d\\exp_X(Y)=[\\hspace{2pt}\\exp\\circ \\hspace{1pt}\\gamma\\hspace{2pt}]$$is some vector in $T_{\\exp(X)} G$. But which vector?\nThe key to making these identifications concrete is to notice that we can choose any $\\gamma$ satisfying $\\gamma(0)=X$ and $\\dot {\\gamma}(0)=Y$. Any such $\\gamma$ is a representative of $Y$. So, we might as well choose a $\\gamma$ that is convenient to write down; let\u0026rsquo;s make the choice, $\\gamma(t) \\coloneq X+tY$. In this case, $\\exp \\circ\\hspace{2pt} \\gamma(t)=\\exp(X+tY)$ is a curve on $G$ passing through $\\exp(X)$, and it is a representative of $d\\exp_X(Y)$. There\u0026rsquo;s two different ways of writing down the equivalence class of $d\\exp_X(Y)$:\n$$ d\\exp_X(Y) =\\big[\\exp(X+tY)\\big]=\\big[\\exp(X)\\exp(t\\hspace{2pt}\\square)\\big] $$where the box is a Lie algebra element: $\\square \\in \\mathfrak g$. There must be a value of $\\square$ for which these are indeed the same equivalence class. The value of $\\square$ can therefore be used to describe the quantity $d\\exp_X(Y)$. Hence, we define the Jacobian of $\\exp$ at $X$ as the linear map\n\\[ \\begin{align*} \\Psi_X:\\mathfrak g \u0026\\rightarrow \\mathfrak g\\\\ Y \u0026\\mapsto \\square \\end{align*} \\] In conclusion, the differential of $\\exp$ takes two objects, the point $X$ and the direction $Y$, and defines the object $\\square$. Here, $\\square$ represents the direction and extent to which $\\exp(X)$ is perturbed when we nudge $X$ in the direction of $Y$. On matrix Lie groups, $\\Psi_X$ is defined by\n\\[ \\Psi_X(Y) =\\square =\\exp(X)^{-1}\\frac{d}{dt}\\exp(X+tY)\\Big|_{t=0}, \\] and we write $d\\exp_X(Y)=\\exp(X)\\Psi_X(Y)$.\n🀳 Differential of $\\log$ By the same arguments as above,\n$$ d\\log_g: T_g G \\rightarrow T_{\\log(g)}\\mathfrak g. $$Since $T_g G\\cong \\mathfrak g$ and $T_{\\log(g)}\\mathfrak g\\cong \\mathfrak g$, we should once again expect a formula that goes from $\\mathfrak g$ to $\\mathfrak g$. It proves to be convenient to work entirely with $\\mathfrak g$-elements and write $\\exp(X)$ in place of $g$.\nLet $\\gamma \\coloneq \\exp(X) \\exp(tY)$ by a curve passing through $\\exp(X)$ with the velocity $Y$. We have,\n\\[ \\begin{align*} d\\log_{\\exp(X)}(Y)\u0026=\\big[\\hspace{1pt}\\log\\mathrel\\circ\\gamma\\hspace{1pt}(t)\\hspace{1pt}\\big]\\\\\u0026=\\big[\\hspace{1pt}\\log\\big(\\exp(X)\\exp(tY)\\big)\\hspace{1pt}\\big]. \\end{align*} \\] As before, we have an alternative way of writing down the equivalence class:\n$$ \\big[\\hspace{1pt}\\log\\big(\\exp(X)\\exp(tY)\\big)\\hspace{1pt}\\big]= \\big[\\hspace{1pt}X + t\\hspace{1pt}\\square\\hspace{1pt}\\big]. $$Thus, the differential of $\\log$ takes two objects, $X$ (alternatively, $g\\coloneq \\exp(X)$) and $Y$, and its output is characterized by $\\square$. Here, $\\square$ is the direction in which $\\log(g)$ is perturbed when we perturb its argument $g$ in the direction $Y$. On matrix Lie groups, we can once again define the operator $\\Psi^{-1}_{\\exp(X)}:\\mathfrak g \\rightarrow \\mathfrak g$ using derivatives of matrix-valued functions:\n\\[ \\Psi^{-1}_{\\exp(X)}(Y) = \\frac{d}{dt}\\log\\big(\\exp(X)\\exp(tY)\\big)\\Big|_{t=0}. \\] We write $d\\log_{g}(gY)=\\Psi^{-1}_{g}(Y)$. Alternatively, we can write $d\\log_{g}(V)=\\Psi^{-1}_{g}(g^{-1}V)$, where $V\\in T_{g} G$.\nSection 3.2 of Jean Gallier's book has the formula for $d\\exp_X$, and the appendix of my paper \"Parameter Estimation on Homogeneous Spaces\" has $d\\log_{\\exp(X)}$ (which I denote there as $\\Psi$... oh well). We should verify that $\\Psi_{\\exp(X)}^{-1}$ is indeed the inverse of $\\Psi_X$. Can we show this algebraically?\n\\[ \\begin{align*} \\small \u0026\\Psi_X\\left(\\Psi_{\\exp(X)}^{-1}(Y)\\right)\\\\ \u0026\\quad = \\exp(X)^{-1}\\frac{d}{ds}\\exp\\left(X+s\\frac{d}{dt}\\log\\big(\\exp(X)\\exp(tY)\\big)\\Big|_{t=0}\\right)\\Big|_{s=0}\\\\ \\small \u0026\\quad \\overset{?}=Y,\\\\ \\small \u0026\\Psi_{\\exp(X)}^{-1}\\left(\\Psi_X(Y)\\right) \\\\\u0026\\quad = \\frac{d}{dt}\\log\\Big(\\exp(X)\\exp\\Big(t \\exp(X)^{-1}\\frac{d}{ds}\\exp\\left(X+sY\\right)\\Big|_{s=0} \\Big)\\Big)\\Big|_{t=0}\\\\ \\small \u0026\\quad \\overset{?}=Y. \\end{align*} \\] I\u0026rsquo;m not sure how to proceed\u0026hellip; In any case, such formidable calculations should not be needed to demonstrate a simple fact. The intrinsic versions of these results are trivial:\n\\[ d\\log_{\\exp(X)}\\left(d\\exp_X(Y)\\right) = d(\\log \\circ \\exp)_X(Y)=Y. \\] \\[ d\\exp_{\\log(g)}\\left(d\\log_g(V)\\right) = d(\\exp \\circ \\log)_g(V)=V. \\] 🀴 Parametrization Another context in which differentials show up in Lie theory is in the context of parametrizations of $G$. While most of this discussion is a specialization of what happens on a general manifold, the Lie group case is special because we are especially interested in left-invariant quantities (e.g., left-invariant vector fields and volume forms).\nLet $\\varphi:C\\rightarrow D$ be a parametrization of $D\\subseteq G$ whose domain is $C\\subseteq \\mathbb R^n$, such that its inverse $\\varphi^{-1}:D \\rightarrow C$ exists (and therefore, is a smooth coordinate chart). A function $f:G\\rightarrow \\mathbb R$ can be restricted to $D$ to define $f|_D:D\\rightarrow \\mathbb R$; nevertheless, we will just write $f$ to refer to the restricted function. We can pull $f$ back under $\\varphi$ to define $\\bar f\\coloneq {\\varphi}^\\ast f=f\\circ \\varphi$, so that $\\bar f(\\mathbf x)=f(\\varphi(\\mathbf x))$. We can differentiate $\\bar f$ using the usual rules of multivariable calculus. Let $\\frac{\\partial}{\\partial x^i}$ be the $i^{th}$ standard vector field in $C$. As vector fields do, it will map $\\bar f$ to another function:\n\\[ \\frac{\\partial}{\\partial x^i}\\big(\\,\\bar f\\,\\big):C \\rightarrow \\mathbb R. \\] As far as differential calculus on manifolds go, the story pretty much ends here. However, on a Lie group we are interested in computing derivatives along the left-invariant vector fields (LIVFs). Letting $\\lbrace E_i \\rbrace_{i=1}^n$ be a basis for $\\mathfrak g$, we can express its LIVFs using the parametrization:\n\\[ E_{i,g}^L = g E_i = {\\overline E_i^j}(g)\\frac{\\partial}{\\partial x^j}\\Big|_{g} \\] Here, $\\frac{\\partial}{\\partial x^i}\\big|_{\\varphi(\\mathbf x)}\\coloneq d \\varphi_{\\mathbf x}\\frac{\\partial}{\\partial x^i}\\big|_{\\mathbf x}$ (by abuse of notation) is called the $i^{th}$ coordinate vector field, and $\\overline E_i^j$ are smooth functions. But what are these functions? We need to set up a commutative diagram that shows us how the action of $E_{i}^L$ on $f$ can be equated to the action of an $\\mathbb R^n$-vector field on $\\bar f$.\nLet $Z = \\mathbf z^i E_i$, where $\\mathbf z^i$ are real numbers (not functions!). Consider\n\\[ \\begin{align*} Z^L_{\\varphi(\\mathbf x)} f \u0026= \\mathbf z^i E^L_{i,\\varphi(\\mathbf x)}f =\\mathbf z^i\\frac{d}{dt}f\\left(\\varphi(\\mathbf x)\\exp(tE_i)\\right)\\Big|_{t=0}\\\\ \u0026= \\mathbf z^i{\\overline E_i^j}(\\varphi(\\mathbf x))\\frac{\\partial}{\\partial x^j}\\Big|_{\\varphi(\\mathbf x)}f\\\\ \u0026= \\mathbf z^i{\\overline E_i^j}(\\varphi(\\mathbf x))\\frac{\\partial \\bar f}{\\partial x^j}(\\mathbf x) \\end{align*} \\] where the last equality follows from the definition of the pushforward. Let $f$ be the $k^{th}$ coordinate function, ${x}^k:D \\rightarrow \\mathbb R$ (that is, $x^k\\coloneq {\\varphi^{-1}}^k$). Then, ${{\\partial x}^k}/{\\partial x^j}=\\delta^k_j$ (with the usual identifications), and we get\n\\[ \\begin{align*} E^L_{i,\\varphi(\\mathbf x)}x^k \u0026= {\\overline E_i^k}(\\varphi(\\mathbf x)). \\end{align*} \\] Hence,\n\\[ \\begin{align*} [Z^Lf](g) \u0026= \\mathbf z^i [E_i^Lx^j](g)\\frac{\\partial \\bar f}{\\partial x^j}(\\varphi^{-1}(g)). \\end{align*} \\] We can write this in the matrix-vector form as\n\\[ \\begin{align*} \\ \\nabla f_{g}^\\top\\mathbf z = \\nabla {\\bar f}^\\top_{\\varphi^{-1}(g)} M(g)\\,\\mathbf z\\ \\end{align*} \\] where\n\\[ \\begin{align*} \\nabla \\bar f \u0026= \\begin{bmatrix}\\frac{\\partial \\bar f}{\\partial x^1} \\\\ \\vdots \\\\ \\frac{\\partial \\bar f}{\\partial x^n}\\end{bmatrix}\\quad\\text{and}\\quad \\nabla f \u0026= \\begin{bmatrix}E_1^Lf \\\\ \\vdots \\\\ E_n^Lf\\end{bmatrix}, \\end{align*} \\] so that $\\nabla f_{g}^\\top \\mathbf z=[Z^L f](g)$, and $M^i_j=E_j^Lx^i$. What is the inverse of $M$? Let $J(\\mathbf x) \\coloneq \\left[M(\\varphi{\\small(\\mathbf x)})\\right]^{-1}$. It should satisfy3 \\[ \\begin{align*} M_j^i\\, J^j_k\\, \u0026= J^j_k [E^L_{j}x^i]\\\\\u0026= [{(J^j_k E_j)}^Lx^i]\\\\ \u0026= \\delta^i_k. \\end{align*} \\] So, $J^j_k$ should be such that\n\\[ (J^j_k(\\mathbf x) E_j)^L_{\\varphi(\\mathbf x)} = \\frac{\\partial}{\\partial x^k}\\Big|_{\\varphi(\\mathbf x)}=d\\varphi_{\\mathbf x}\\left(\\frac{\\partial}{\\partial x^k}\\Big|_{\\mathbf x}\\right). \\] In the matrix-vector form, we can write\n\\[ \\ \\nabla {\\bar f}_{\\mathbf x}^\\top\\mathbf y = \\nabla f_{\\varphi(\\mathbf x)}^\\top J(\\mathbf x)\\mathbf y\\ \\] Thus, the question of whether one uses $M$ or $J$ in calculations depends on whether the direction of differentiation is specified in the Lie algebra basis, or in the standard basis of $C\\subseteq \\mathbb R^n$ (i.e., it depends on whether it is $\\mathbf z$ or $\\mathbf y$ that is specified).\nWe can recover an explicit formula for $J$ that is used in the work of Chirikjian.4 On a matrix Lie group, we can view $\\varphi:\\mathbb R^n\\rightarrow \\mathbb R^{m \\times m}$ as a map between vector spaces. Letting $(\\hspace{1pt}\\cdot\\hspace{1pt})^\\vee:\\mathfrak g\\rightarrow \\mathbb R^n$ be the map $Z \\mapsto\\mathbf z$, which Chirikjian calls the vee map, we have\n\\[ J(\\mathbf x) = \\begin{bmatrix}\\big(\\varphi (\\mathbf x)^{-1}\\frac{\\partial \\varphi}{\\partial x^1}\\big)^\\vee \u0026 \\big(\\varphi (\\mathbf x)^{-1}\\frac{\\partial \\varphi}{\\partial x^2}\\big)^\\vee \u0026\\cdots \u0026 \\big(\\varphi (\\mathbf x)^{-1}\\frac{\\partial \\varphi}{\\partial x^n}\\big)^\\vee \\end{bmatrix}. \\] and\n\\[ M(g) = \\begin{bmatrix}E_1^L \\varphi^{-1}(g) \u0026 E_2^L \\varphi^{-1}(g) \u0026 \\cdots \u0026 E_n^L \\varphi^{-1}(g) \\end{bmatrix}= J(\\varphi^{-1}(g))^{-1}. \\] The map $\\varphi:\\mathbf x\\mapsto\\exp(\\mathbf x^iE_i)$ is called the exponential parametrization, whose inverse is $\\varphi^{-1}(g)=\\log(g)^\\vee$. In this case, $$ \\begin{align*} J(\\mathbf x)\\mathbf y\u0026=\\left(\\varphi(\\mathbf x)^{-1}\\left(\\mathbf y^i\\frac{\\partial \\varphi}{\\partial x^i}\\Big|_{\\mathbf x}\\right)\\right)^\\vee \\\\ \u0026= \\left(\\varphi(\\mathbf x)^{-1}\\frac{d}{d t}\\exp\\big((\\mathbf x^i + t \\mathbf y^i) E_i\\big)\\Big|_{t=0}\\right)^\\vee \\\\ \u0026= \\left(\\varphi(\\mathbf x)^{-1}\\frac{d}{d t}\\exp(X + tY)\\Big|_{t=0}\\right)^\\vee =\\big(\\Psi_X(Y)\\big)^\\vee, \\end{align*} $$ where $X^\\vee =\\mathbf x$ and $Y^\\vee =\\mathbf y$. Hence, $J(\\mathbf x)$ is nothing but the linear operator $\\Psi_X$ expressed in a specific basis. Similarly, $$ \\begin{align*} M(g)\\mathbf z\u0026= \\big[Z^L\\log^\\vee\\big](g) \\\\ \u0026= \\left(\\frac{d}{d t}\\log\\big(g\\exp(tZ)\\big)\\Big|_{t=0}\\right)^\\vee \\\\ \u0026=\\big(\\Psi^{-1}_g(Z)\\big)^\\vee. \\end{align*} $$ To put this in yet another way, $J(\\mathbf x) \\mathbf y$ expresses the vector $d\\varphi_{\\mathbf x}(\\mathbf y)$ in the basis of LIVFs, $\\lbrace E_{i,g}^L\\rbrace_{i=1}^n$ at $g=\\varphi (\\mathbf x)$. Meanwhile, $M(g) \\mathbf z$ expresses the LIVF $Z^L_g$ in the standard coordinate basis at $\\varphi^{-1}(g)$.\n🀵 Lengths and Volume In this final chapter, let\u0026rsquo;s look at how the Jacobians of a parametrization relate to the measurement of lengths and volumes on $G$. Assume that we have an inner product for $\\mathfrak g$, defined via\n\\[ \\langle{E_i},{E_j}\\rangle= W_{ij}. \\] This inner product defines a unique left-invariant Riemannian metric on $G$ (as explained in my previous posts), written as $W_{ij}\\varepsilon^i \\varepsilon^j$, where $\\lbrace \\varepsilon_{i}\\rbrace_{i=1}^n$ is the coframe that is dual to the frame of LIVFs, $\\lbrace E_{i}^L\\rbrace_{i=1}^n$. We now move to a point $\\mathbf x$ in the parameter space. There is presently no inner product for $T_{\\mathbf x}C$, since we haven\u0026rsquo;t chosen one. The key observation is that the Riemannian metric and volume form are both covariant tensor fields, so they can be pulled back (under $\\varphi$) from $D$ to $C$. For instance, if $\\langle \\cdot , \\cdot \\rangle_{\\mathbf x}$ is the pullback of the left-invariant Riemannian metric, then\n\\[ \\begin{align*} \\left\\langle \\frac{\\partial}{\\partial x^r},\\frac{\\partial}{\\partial x^s} \\right\\rangle_{\\mathbf x} \u0026= \\varphi^\\ast_{\\mathbf x}[W_{ij}\\varepsilon^i \\varepsilon^j]\\left(\\frac{\\partial}{\\partial x^r},\\frac{\\partial}{\\partial x^s}\\right)\\\\ \u0026= W_{ij}\\varepsilon^i \\varepsilon^j\\left(d\\varphi_{\\mathbf x}\\frac{\\partial}{\\partial x^r},d\\varphi_{\\mathbf x}\\frac{\\partial}{\\partial x^s}\\right)\\\\ \u0026= W_{ij}J(\\mathbf x)^i_rJ(\\mathbf x)^j_s. \\end{align*} \\] Hence, we have a weighted inner product at $\\mathbf x$: $$\\langle \\mathbf y,\\mathbf z \\rangle_{\\mathbf x}=\\mathbf y^\\top J(\\mathbf x)^\\top \\mathbf W J(\\mathbf x)\\mathbf z,$$and as $\\mathbf x$ varies, this defines a Riemannian metric on $C$. Unlike the left-invariant Riemannian metric on $D\\subseteq G$, the coefficients (i.e., the metric tensor) of the pullback Riemannian metric on $C\\subseteq\\mathbb R^n$ are not constant.\nThe choice of basis for $\\mathfrak g$ defines a unique density on $G$, written as $$\\omega=|\\varepsilon^1\\wedge\\varepsilon^2\\wedge\\cdots\\wedge \\varepsilon^n|.$$ This is an example of a left Haar measure for $G$, which is unique up to scaling. The absolute value is taken to make this a density rather than a volume form, saving us the trouble of worrying about orientation.\nBy a similar procedure, we can compute the pullback volume form on $C$. Let $\\varphi^\\ast\\omega=\\lambda(\\mathbf x) dx^1\\wedge\\cdots\\wedge dx^n$. Then,\n\\[ \\begin{align*} \\lambda(\\mathbf x) \u0026= \\varphi^\\ast\\omega\\left(\\frac{\\partial}{\\partial x^1},\\cdots, \\frac{\\partial}{\\partial x^n}\\right)\\\\ \u0026= |\\varepsilon^1\\wedge\\varepsilon^2\\wedge\\cdots\\wedge \\varepsilon^n|\\left(d\\varphi_{\\mathbf x}\\frac{\\partial}{\\partial x^1},\\cdots, d\\varphi_{\\mathbf x}\\frac{\\partial}{\\partial x^n}\\right)\\\\ \u0026= |\\varepsilon^1\\wedge\\varepsilon^2\\wedge\\cdots\\wedge \\varepsilon^n|\\left(\\,J_1^iE_i^L,\\,J_2^iE_i^L,\\cdots,\\, J_n^iE_i^L\\,\\right)\\\\ \u0026= \\left\\lvert\\det\\left(\\begin{bmatrix} J_1^1 \u0026 J_2^1 \u0026 \\cdots \u0026 J_n^1\\\\ J_1^2 \u0026 J_2^2 \u0026 \\cdots \u0026 J_n^2\\\\ \\vdots\u0026 \\vdots\u0026 \u0026\\vdots\\\\ J_1^n \u0026 J_2^n \u0026 \\cdots \u0026 J_n^n\\\\ \\end{bmatrix}\\right)\\right\\rvert \\\\ \u0026= |\\det(\\,J{\\small (\\mathbf x)}\\,)|, \\end{align*} \\] where we have omitted the argument $(\\mathbf x)$ of $J_i^j(\\mathbf x)$ for brevity. We can now integrate a function $f:D \\rightarrow \\mathbb R$ w.r.t. the Haar measure $\\omega$, by pulling the integral back to the parameter space:\n\\[ \\begin{align*} \\int_{D\\subseteq G} f \\cdot \\omega \u0026= \\int_{C\\subseteq \\mathbb R^n} \\varphi^\\ast f \\cdot\\varphi^\\ast \\omega \\\\\u0026= \\int_{C\\subseteq \\mathbb R^n} \\bar f \\cdot |\\det (J) |\\cdot dx^1\\wedge \\cdots \\wedge dx^n, \\end{align*} \\] where I use $`\\hspace{1pt}\\cdot\\hspace{1pt}'$ for clarity, to indicate the multiplication of a function with another function or $n$-form.\nThis is also what\u0026rsquo;s called a $1$- jet . This notion of an \u0026rsquo;equivalence class of curves\u0026rsquo; generalizes in confusing ways.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nHere, $\\dot \\gamma(0)$ refers to the pushforward of $\\frac{\\partial}{\\partial t}\\big\\vert_{t=0}$ under $\\gamma$.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nDon\u0026rsquo;t worry too much about which indices I place on top and which on the bottom. I just like the placement of the indices of a matrix to be consistent with those of its components.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nNote that Chirikjian writes $J$ as \u0026ldquo;$J_r$\u0026rdquo;, although I prefer to think of $J$ as the left Jacobian due to the involvement of left-invariant vector fields. Nevertheless, our calculations agree with his.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://shirazkn.github.io/posts/lemmas/","summary":"Some results about the differentials (i.e., pushforwards) of the exponential and logarithm maps of a Lie group. I rely extensively on the interpretation of tangent vectors as equivalence classes of curves.","title":"Differentials in Lie Theory"},{"content":"Poincaré\u0026rsquo;s 1901 paper 1 introduces (in just a humble 3 pages) the Euler-Poincaré equations, which are the specialization of the Euler-Lagrange equations to the case where a Lie group $G$ acts on the manifold $\\mathcal Q$. In this case, vector fields on $\\mathcal Q$ can be expressed in terms of the infinitesimal actions of the group, and the Euler-Lagrange equations reduce to the Euler-Poincaré equations. In what follows, I work through Poincaré\u0026rsquo;s paper without making too many identifications .\nPaths in $\\mathcal Q$ Think of the space $\\mathcal Q$ as being the $n$-dimensional configuration space of some dynamical system. It is assumed that an $r$-dimensional Lie group $G$ acts transitively on $\\mathcal Q$, i.e., for any two points $q, q^\\prime \\in \\mathcal Q$, there exists a (not necessarily unique) $g\\in G$ such that $g\\cdot q = q^\\prime$. Given a basis $\\lbrace E_i \\rbrace_{i=1}^r$ for $\\mathfrak g$, the group action induces a set of vector fields $\\lbrace \\overline E_i\\rbrace_{i=1}^r$ on $\\mathcal Q$ which are sometimes called \u0026lsquo;infinitesimal generators\u0026rsquo;. These vector fields may be defined in terms of their actions on an arbitrary function $f\\in C^\\infty(\\mathcal Q)$:\n$$[\\overline E_i f](q) = \\lim_{\\epsilon\\rightarrow 0} \\frac{f(\\exp(\\epsilon E_i)\\cdot q) - f(q)}{\\epsilon}.$$We will also use $\\Phi(g, q) \\coloneq g\\cdot q$ and $\\Phi^{(q)}(g) \\coloneq \\Phi(g, q)$ to denote the group action. Then, $\\overline E_{i,q}$ is nothing but the pushforward vector $d\\Phi^{(q)}_e( E_i)$.\nLet the map $\\gamma:[0,1] \\rightarrow \\mathcal Q$ describe a path in $\\mathcal Q$ (e.g., the trajectory of a dynamical system whose state at time $t$ is $\\gamma(t)$), starting at time $t=0$ and ending at time $t=1$. Due to transitivity of the group action, we can express the \u0026ldquo;velocity\u0026rdquo; vectors of $\\gamma$ as follows:\n$$ \\begin{align*} \\dot\\gamma(s) \\coloneq d \\gamma_s \\left(\\frac{\\partial}{\\partial t}\\Big\\rvert_{t=s}\\right) \u0026= \\eta^i(s) \\overline E_i\\rvert_{\\gamma(s)} \\\\ \\end{align*} $$$\\in T_{\\gamma(s)}\\mathcal Q$, where $\\eta^i(s)$ are some coefficient/component functions that describe the velocity of $\\gamma$. Note that the coefficients $\\left\\lbrace \\eta^i\\right\\rbrace_{i=1}^n$ are not unique when $n \u003c r$, since $\\left\\lbrace \\overline E_{i,\\gamma(s)}\\right\\rbrace_{i=1}^r$ may then be an over-complete basis for $T_{\\gamma(s)}\\mathcal Q$.\nVariation of $\\gamma$ Denote the space of paths on $\\mathcal Q$ as $\\mathcal P\\mathcal Q$, so that $\\gamma \\in \\mathcal P\\mathcal Q$. Consider a family of paths $\\Gamma(\\ \\cdot\\ , \\lambda)$ such that $\\Gamma(t, 0) = \\gamma(t)$. As $\\lambda$ is varied, we think of $\\Gamma(\\ \\cdot\\ , \\lambda)$ as a path in $\\mathcal P \\mathcal Q$, i.e., a path through path space! We know that the derivatives of such paths-of-paths at $\\lambda = 0$ should correspond bijectively to the tangent vectors in $T_{\\gamma}\\mathcal P \\mathcal Q$:\n$$ \\frac{\\partial}{\\partial\\lambda} \\Gamma (\\ \\cdot\\ , \\lambda)\\Big\\rvert_{\\lambda=0} \\in T_{\\gamma}\\mathcal P \\mathcal Q. $$ Technically, we should write $$ \\textrm{d}\\Gamma_{(t, 0)}\\left(\\frac{\\partial}{\\partial \\lambda}\\bigg\\rvert_{(t,0)}\\right)\\in T_{\\gamma(t)} \\mathcal Q $$ where $\\Gamma:[0,1]\\times[-\\epsilon,\\epsilon] \\rightarrow \\mathcal Q.$ By an abuse of notation, we will write $$\\frac{\\partial}{\\partial\\lambda}\\Gamma(t, \\lambda)\\Big|_{\\lambda = 0}$$ to refer to the foregoing object. For each $t\\in[0, 1]$, this is nothing but a tangent vector of $\\mathcal Q$:\n\\[ \\frac{\\partial}{\\partial\\lambda} \\Gamma (t, \\lambda)\\Big\\rvert_{\\lambda=0} \\in T_{\\gamma(t)}\\mathcal Q. \\] Thus, as we vary $t$, we obtain a vector field along $\\gamma$.\nAn element $\\delta\\hspace{-1.5pt}\\gamma$$\\hspace{1pt}\\in T_{\\gamma} \\mathcal P \\mathcal Q$ of this tangent space is commonly referred to as a variation of $\\gamma$.2 It is both a vector in $T_{\\gamma} \\mathcal P \\mathcal Q$ and a vector field (along $\\gamma$) on $\\mathcal Q$. Since it\u0026rsquo;s a vector field, we know that $\\delta\\hspace{-1.5pt}\\gamma$ can be expressed as\n\\[ \\delta\\hspace{-1.5pt}\\gamma(t) \\coloneq \\frac{\\partial}{\\partial \\lambda} \\Gamma(t, \\lambda)\\Big\\rvert_{\\lambda=0} =\\xi^i(t) \\overline E_{i,\\gamma(t)}. \\] Meanwhile, the $t$-derivative of $\\Gamma$ also defines a vector field:\n$$ \\dot\\Gamma(s,\\lambda) \\coloneq \\frac{\\partial}{\\partial t} \\Gamma(t, \\lambda)\\Big\\rvert_{t=s} = \\eta^i(s, \\lambda)\\overline E_{i,\\Gamma(s,\\lambda)}.$$Since $\\Gamma(\\hspace{1pt}\\cdot\\hspace{1pt},\\lambda)$ coincides with $\\gamma$ at $\\lambda=0$, we have the compatibility condition $\\eta^i(t,0)=\\eta^i(t)$.\nNotation Meaning $\\Gamma(\\cdot, \\lambda)$ perturbed version of $\\gamma$ $\\delta\\hspace{-1.5pt}\\gamma$ direction in which $\\gamma$ is perturbed (i.e., variation of $\\gamma$) $\\dot\\Gamma(\\cdot, \\lambda)$ velocity vector field of $\\Gamma(\\cdot, \\lambda)$ $\\left\\lbrace \\xi^i,\\eta^i \\right\\rbrace_{i=1}^r$ coefficients of $\\delta\\hspace{-1.5pt}\\gamma$ and $\\dot\\Gamma$, respectively $\\dot{\\square}$, $\\delta\\hspace{0pt}\\square$ derivatives of $\\square$ with respect to $t$ and $\\lambda$, resp. Now for the tricky part \u0026ndash; any variation of $\\gamma$ induces a variation of $\\dot\\gamma$, which can be described by the functions $\\lbrace\\delta\\hspace{-1.5pt}\\eta^i\\rbrace_{i=1}^r$ defined as follows:\n$$ \\begin{align*} \\delta\\hspace{-1.5pt}\\eta^i \u0026\\coloneq \\frac{\\partial}{\\partial\\lambda} \\eta^i(s, \\lambda)\\Big\\rvert_{\\lambda=0}. \\end{align*} $$As the variational principle requires us to pass from the $n$-dimensional space $\\mathcal Q$ to the $(n+r)$-dimensional space $\\mathcal Q \\times \\mathfrak g$, we need additional constraints that relate $\\delta\\hspace{-1.5pt}\\gamma$ to $\\lbrace\\delta\\hspace{-1.5pt}\\eta^i\\rbrace_{i=1}^r$, ensuring that the variations of $\\gamma$ and $\\dot \\gamma$ are compatible.3 To put it differently, notice that since $\\delta\\hspace{-1.5pt}\\gamma$ determines the perturbed curve $\\Gamma(\\hspace{1pt}\\cdot\\hspace{1pt}, \\lambda)$ for vanishingly small values of $\\lambda$, the value of the velocity vectors are already specified by $\\delta\\hspace{-1.5pt}\\gamma$, meaning that we don\u0026rsquo;t have the freedom to also specify $\\lbrace\\delta\\hspace{-1.5pt}\\eta^i\\rbrace_{i=1}^r$ arbitrarily.\nAnnoyingly, Poincaré considers this matter a triviality. His (translated) paper reads \u0026ldquo;\u0026hellip; and [one] can easily find\u0026hellip;\u0026rdquo; before the result is presented. It is indeed a triviality when $G$ is abelian4, in which case we can use what Marsden \u0026amp; Ratiu like to call the equality of mixed partials: $\\frac{\\partial^2}{\\partial t \\partial \\lambda} = \\frac{\\partial^2}{\\partial \\lambda \\partial t}$. Much of what follows is a discussion of how this equality manifests in the non-abelian case, without resorting to the convenience of matrix Lie groups.\nVariation of $\\dot\\gamma$ The reader may want to specialize to matrix Lie groups and finish the argument algebraically, as done in Theorem 13.5.3 of Introduction to Mechanics and Symmetry by Marsden \u0026amp; Ratiu. Here, I will do a slightly more general version of the proof. What follows can be supplemented with Lee\u0026rsquo;s Introduction to Riemannian Manifolds (Lee IRM).\nConsider an affine torsion-free connection $\\nabla$ on $\\mathcal Q$. The curve $\\gamma(t)$ and the connection $\\nabla$ together define the covariant derivative operator, $D_t(\\hspace{1pt}\\cdot\\hspace{1pt})$. Using the Symmetry Lemma [Lemma 6.2, Lee IRM], we have\n$$ \\begin{align*} D_t\\left[\\frac{\\partial}{\\partial \\lambda}\\Gamma(t, \\lambda)\\Big|_{\\lambda=0}\\right](s) = D_{\\lambda} \\left[\\frac{\\partial}{\\partial t}\\Gamma(t, \\lambda)\\Big|_{t=s}\\right](0). \\end{align*} $$Either side of this equation is a vector at $\\Gamma(s,0)$. The idea is that we need to relate the left-hand side to $\\dot\\xi^i$ and the right-hand side to $\\delta\\hspace{-1.5pt}\\eta^i$. The left-hand side is5\n$$ \\begin{align*} D_t\\delta\\hspace{-1.5pt}\\gamma(s) \u0026= \\dot\\xi^i(s)\\overline E_{i,\\gamma(s)} + \\xi^i(s)\\big[\\nabla_{\\dot\\gamma(s)}\\overline E_{i}\\big]_{\\gamma(s)}\\\\ \u0026= \\dot\\xi^i(s)\\overline E_{i,\\gamma(s)} + \\xi^i(s)\\eta^j(s)\\big[\\nabla_{\\overline E_{j}}\\overline E_{i}\\big]_{\\gamma(s)}, \\end{align*} $$whereas the right-hand side is\n$$ \\begin{align*} D_{\\lambda} \u0026\\left[\\eta^i(s, \\lambda)\\overline E_{i,\\Gamma(s,\\lambda)}\\right](0) \\\\\u0026= \\frac{\\partial }{\\partial\\lambda} \\eta^i(s, \\lambda)\\Big|_{\\lambda=0}\\overline E_{i,\\Gamma(s,0)} + \\eta^i(s, 0) \\big[\\nabla_{\\delta\\hspace{-1.5pt}\\gamma(s)}\\overline E_{i}\\big]_{\\gamma(s)} \\\\\u0026= \\delta\\hspace{-1.5pt}\\eta^i(s)\\overline E_{i,\\gamma(s)} + \\eta^i(s)\\big[\\nabla_{\\delta\\hspace{-1.5pt}\\gamma(s)}\\overline E_{i}\\big]_{\\gamma(s)}\\\\\u0026= \\delta\\hspace{-1.5pt}\\eta^i(s)\\overline E_{i,\\gamma(s)} + \\xi^j(s)\\eta^i(s)\\big[\\nabla_{\\overline E_{j}}\\overline E_{i}\\big]_{\\gamma(s)} \\\\\u0026= \\delta\\hspace{-1.5pt}\\eta^i(s)\\overline E_{i,\\gamma(s)} \\\\\u0026\\qquad+ \\xi^j(s)\\eta^i(s)\\hspace{2pt}\\big(\\hspace{1pt}\\nabla_{\\overline E_{i}}\\overline E_{j} - [{\\overline E_{i}}, \\overline E_{j} ]\\big)_{\\gamma(s)}. \\end{align*} $$The last equality (as well as the symmetry lemma itself) follows from the connection being torsion-free. Note that $[{\\overline E_{i}}, \\overline E_{j} ] = L_{\\overline E_{i}} \\overline E_{j}$ is the Lie derivative; we will return to this point shortly. For now, we observe that the symmetry lemma yields\n$$ \\begin{align*} \\dot\\xi^i(s)\\overline E_{i,\\gamma(s)} \u0026= \\delta\\hspace{-1.5pt}\\eta^i(s)\\overline E_{i,\\gamma(s)} - \\eta^i(s)\\xi^j(s)[{\\overline E_{i,\\gamma(s)}}, \\overline E_{j,\\gamma(s)} ]. \\end{align*} $$Returning to $\\mathfrak g$ We are not quite done; the reader will notice that our expression appears to agree with Marsden and Ratiu\u0026rsquo;s, but has a sign-difference when compared to Poincaré\u0026rsquo;s. Actually, our equations are closer to Poincaré\u0026rsquo;s. The apparent discrepancy is due to the fact that the vector fields $\\lbrace \\overline E_i\\rbrace_{i=1}^r$ are more closely related to the right-invariant vector fields (RIVFs) on $G$ than they are to the left-invariant vector fields (LIVFs). In particular, there is a Lie algebra anti-homomorphism: $[{\\overline E_i, \\overline E_j}] = -\\overline{[ E_i, E_j]}$ (proven in the appendix), since the usual Lie bracket on $\\mathfrak g$ is also defined via LIVFs. Our urge to make everything \u0026ldquo;act from the left\u0026rdquo; in mathematical notation has led us to consider left group actions on $\\mathcal Q$, and I suppose the same urge has made LIVFs the predominant choice on $G$.\nLet $\\xi(s) \\coloneq \\xi^i(s) E_i$ and $\\delta\\hspace{-1.5pt}\\eta(s) \\coloneq \\delta\\hspace{-1.5pt}\\eta^i(s) E_i$ be curves in $\\mathfrak g$. Note that $\\overline{\\small\\dot\\xi(s)} = \\dot\\xi^i(s)\\overline E_{i,\\gamma(s)}$. We have,\n$$ \\begin{align*} \\overline{\\small\\dot\\xi} = \\overline{\\delta\\hspace{-1.5pt}\\eta}- \\big[\\hspace{2pt}\\overline{\\small {\\delta\\hspace{-1.5pt}\\eta}}\\hspace{0.5pt},\\hspace{1pt}\u0026\\overline{\\small \\xi}\\hspace{2pt}\\big]= \\overline{\\delta\\hspace{-1.5pt}\\eta} + \\overline{[\\hspace{1.5pt}{\\small\\delta\\hspace{-1.5pt}\\eta}\\hspace{0.5pt},\\hspace{1pt} {\\small \\xi\\hspace{1.5pt}}]}\\\\ \u0026\\Downarrow \\\\ \\dot\\xi={\\delta\\hspace{-1.5pt}\\eta} \u0026+ {[\\hspace{1.5pt}{\\small\\delta\\hspace{-1.5pt}\\eta}\\hspace{0.5pt},\\hspace{1pt} {\\small \\xi\\hspace{1.5pt}}]}. \\end{align*} $$Finally, we have something that agrees with Poincaré\u0026rsquo;s note.[^alt] It agrees with Marsden \u0026amp; Ratiu\u0026rsquo;s too, but corresponds to the right-invariant version of their result, which in their book is left as an exercise to the reader. In the proof presented by M \u0026amp; R, they consider the right group action of $G$ on $\\mathcal Q$, whereas we have considered a left action. Consequently, M \u0026amp; R describe $\\dot\\Gamma$ and $\\delta\\hspace{-1.5pt}\\gamma$ using their left-invariant velocities, while we had to work with their right-invariant counterparts.\nIn the basis we chose for $\\mathfrak g$, we can compute the structure constants $\\lbrace c_{ij}^k\\rbrace_{i,j,k=1}^r$, after which we can write\n$$ \\dot\\xi^i = {\\delta\\hspace{-1.5pt}\\eta}^i \\ +\\ {\\delta\\hspace{-1.5pt}\\eta}^{\\hspace{0.5pt}j}\\, \\xi^k \\,c_{jk}^i. $$Euler-Poincaré Equations Let $\\mathscr L:T\\mathcal Q \\rightarrow \\mathbb R$ be a smooth function, called the Lagrangian (recall that a Hamiltonian $\\mathscr H$ is instead a function on $T^\\ast \\mathcal Q$). Let $\\eta(t)\\coloneq \\eta^i(t)E_i$. Given a tuple $(\\gamma(t),\\eta(t))$, we can map it to a unique point in $T\\mathcal Q$: $$ (\\gamma(t),\\eta(t))\\mapsto\\big(\\gamma(t),\\overline{\\eta(t)}_{\\gamma(t)}\\big) $$ (this is not precisely the pushforward map of $\\Phi$, but it\u0026rsquo;s very close!) Consequently, we can pull back $\\mathscr L$ to define a function on $\\mathcal Q \\times \\mathfrak g$, denoted as $\\mathscr L'$. The function $\\mathscr L'$ will be invariant under the transformations of $\\eta$ that leave $\\overline{\\eta}$ unchanged.\nNow, $\\mathscr L'$ can be further pulled back under $\\tilde \\gamma : t\\mapsto(\\gamma(t),\\eta(t))$ to define $\\mathscr L' \\circ \\tilde\\gamma$, a function on $[0,1]$ that can be integrated! Consider the mapping $\\mathscr A:\\mathcal P(\\mathcal Q\\times \\mathfrak g) \\rightarrow \\mathbb R$ defined by\n$$ \\begin{align} \\mathscr A(\\tilde\\gamma) \u0026= \\int_0^1 \\mathscr L' \\circ \\tilde\\gamma\\,(t)\\hspace{1pt} dt = \\int_0^1 \\mathscr L'\\big(\\gamma(t), \\eta(t)\\big)\\hspace{1pt} dt, \\end{align} $$called the action functional.6 Even before we consider the variation in $\\gamma$, we already see that $\\gamma(t)$ and $\\eta(t)$ should satisfy a constraint, namely that $\\dot\\gamma(t)=\\eta^i(t)\\overline{E_i}_{\\gamma(t)}$. In the classical derivation, this was replaced by the \u0026ldquo;identity\u0026rdquo; $\\dot q(t) = \\frac{d}{dt} q(t)$.\n$$ \\require{amscd} \\begin{CD} \\mathcal Q \\times \\mathfrak g @\u003e{(p,\\, X) \\mapsto (p,\\,\\overline{X}_p)}\u003e\u003e T\\mathcal Q \\\\ @A{\\tilde\\gamma}AA @VV{\\mathscr L}V \\\\ [0,1] @\u003e{{\\mathscr L}'\\circ\\tilde\\gamma}\u003e\u003e \\mathbb{R} \\end{CD} $$The variations in $\\gamma$ and $\\eta$, will introduce another constraint. That is, $\\delta\\hspace{-1pt}\\eta$ should be compatible with $\\delta\\hspace{-1pt}\\gamma$ (as was made precise in the preceding section). These variations will together induce a variation in $\\mathscr L'$, and consequently in $\\mathscr A$, the last of which we will set to $0$. I will write this using coordinates (see Appendix A) because (i) we can, and (ii) I don\u0026rsquo;t know how to do this in a coordinate-free way (yet):\n$$ \\begin{align} {d\\mathscr A}_\\gamma \\big((\\delta\\hspace{-1pt}\\gamma, \\delta\\hspace{-1.5pt}\\eta)\\big) \u0026= \\int_0^1 \\left[\\frac{\\partial \\mathscr L'}{\\partial\\,\\gamma^i\\,}\\delta\\hspace{-1pt}\\gamma^i + \\frac{\\partial \\mathscr L'}{\\partial\\,\\eta^i\\,}\\delta\\hspace{-1pt}\\eta^i\\right]\\big(\\tilde\\gamma(t)\\big)\\hspace{1pt} dt =0. \\end{align} $$The object ${d\\mathscr A}_\\gamma $ is written by Marsden \u0026amp; Ratiu as $\\frac{\\delta \\mathscr A}{\\delta \\gamma\\ }$ \u0026ndash; it is the exterior derivative of $\\mathscr A$. Using the compatibility condition from the previous section (as well as App. A), we have\n$$ \\begin{align} \u0026\\int_0^1 \\left[\\frac{\\partial \\mathscr L'}{\\partial\\,\\gamma^i\\,}\\xi^j \\overline E_j^i + \\frac{\\partial \\mathscr L'}{\\partial\\,\\eta^j\\,}\\dot\\xi^j \\ +\\ \\frac{\\partial \\mathscr L'}{\\partial\\,\\eta^i\\,}{\\delta\\hspace{-1.5pt}\\eta}^{k} \\xi^j c_{jk}^i\\right]\\hspace{1pt} dt,\\\\ \u0026=\\int_0^1 \\xi^j \\left[\\frac{\\partial \\mathscr L'}{\\partial\\,\\gamma^i\\,} \\overline E_j^i - \\frac{d}{dt}\\left(\\frac{\\partial \\mathscr L'}{\\partial\\,\\eta^j\\,}\\right)+ \\frac{\\partial \\mathscr L'}{\\partial\\,\\eta^i\\,} {\\delta\\hspace{-1.5pt}\\eta}^{k}c_{jk}^i\\right]\\hspace{1pt} dt\\\\ \u0026\\qquad\\qquad\\qquad +\\int_0^1 \\frac{d}{dt}\\left(\\frac{\\partial \\mathscr L'}{\\partial\\,\\eta^j\\,} \\xi^j\\right) dt =0. \\end{align} $$But the fundamental theorem of calculus tells us that\n$$ \\int_0^1 \\frac{d}{dt}\\left(\\frac{\\partial \\mathscr L'}{\\partial\\,\\eta^j\\,} \\xi^j\\right) dt = \\left[\\frac{\\partial \\mathscr L'}{\\partial\\,\\eta^j\\,} \\xi^j\\right]_{t=0}^{t=1}. $$For variations that fix the endpoints (I explain in the footnotes why we need this constraint), the above term is $0$, and we can localize the integral to obtain the Euler-Lagrange equations:\n$$ \\begin{align} \\frac{d}{dt}\\left(\\frac{\\partial \\mathscr L'}{\\partial\\,\\eta^j\\,}\\right) = \\frac{\\partial \\mathscr L'}{\\partial\\,\\eta^i\\,} {\\delta\\hspace{-1.5pt}\\eta}^{k}c_{jk}^i + \\frac{\\partial \\mathscr L'}{\\partial\\,\\gamma^i\\,} \\overline E_j^i . \\end{align} $$And there it is, une forme nouvelle des équations de la mécanique. As Poincaré points out, this is especially of interest when $\\mathscr L'$ only depends on $\\eta$ (e.g., when computing geodesic motion).\nAppendices A. Computation in Coordinates Let $\\lbrace q^i\\rbrace_{i=1}^n$ be coordinates on a subset of $\\mathcal Q$. We can express $\\overline E_i$ in terms of the coordinate frame $\\lbrace {\\partial}/{\\partial q^i}\\rbrace_{i=1}^n$, as $$\\overline E_i = \\overline E_i^j \\frac{\\partial}{\\partial q^j},$$ where each $\\overline E_i^j \\in C^\\infty(\\mathcal Q)$ is a coordinate function. Letting $\\overline E_i$ act on the coordinate function $q^j$, we get\n$$ \\overline E_i q^j = \\overline E_i^k \\frac{\\partial q^j}{\\partial q^k} = \\overline E_i^j, $$which tells us how to compute the coefficients of $\\overline E_i$ \u0026ndash; just feed it the coordinate functions. Returning to the definition of $\\dot\\gamma(s)$, we get:\n$$\\dot\\gamma(s) = \\eta^i(s) \\overline E_i^j(\\gamma(s)) \\frac{\\partial}{\\partial q^j}\\Big\\vert_{\\gamma(s)}.$$Letting the left-hand side act on the coordinate function $q^k$, we get\n$$ \\begin{align*} \\big(\\dot\\gamma(s)\\big) (q^k) \u0026= d \\gamma_s \\left(\\frac{\\partial}{\\partial t}\\Big\\rvert_{t=s}\\right)(q^k)=\\frac{d}{dt}(q^k\\circ \\gamma)(t)\\Big|_{t=s}, \\end{align*} $$whereas letting the right-hand side eat $q^k$, we get\n$$ \\begin{align*} \\eta^i(s) \\overline E_i^j(\\gamma(s)) \\frac{\\partial q^k}{\\partial q^j}\\Big\\vert_{\\gamma(s)} = \\eta^i(s) \\overline E_i^k(\\gamma(s)). \\end{align*} $$Here, $i$ sums from $1$ to $r$ whereas $j$ sums (and $k$ ranges) from $1$ to $n$. Also, observe that $r\\geq n$ due to transitivity of the group action. Denoting $q^j \\circ \\gamma$ as $\\gamma^j$, we can finally put everything together:\n$$ \\frac{d \\gamma^k}{dt}(s) = \\eta^i(s) \\overline E_{i,\\gamma(s)} \\gamma^k(s). $$This expression can be found in Poincaré\u0026rsquo;s paper (linked at the beginning of the post). Similarly, we have\n$$ \\begin{align*} \\delta\\hspace{-1.5pt}\\gamma(s) =\\delta\\hspace{-1.5pt}\\gamma^i(s)\\frac{\\partial}{\\partial q^i}\\Big\\vert_{\\gamma(s)} = \\xi^j(s) \\overline E_j^i \\frac{\\partial}{\\partial q^i}\\Big\\vert_{\\gamma(s)}. \\end{align*} $$B. Computation using Matrix Algebra We can choose $\\Gamma$ to be\n$$\\Gamma(t, \\lambda) = \\exp(\\lambda \\xi^i(t) E_i)\\cdot\\gamma(t)$$since it satisfies the conditions for being a representative of $\\delta\\hspace{-1.5pt}\\gamma$. Actually, we can do something more; we can express $\\gamma(t)$ as $g(t) \\cdot p$, where $p\\in \\mathcal Q$ is a some distinguished point (that may be called the \u0026ldquo;origin\u0026rdquo; of $\\mathcal Q$) and $g(t)$ is a curve of actions in $G$. This means that\n$$\\Gamma(t, \\lambda) = \\exp(\\lambda \\xi^i(t) E_i)g(t)\\cdot p.$$The curve $g$ that generates $\\gamma$ in this manner is not unique even after we have fixed some point $p$. To see why, one can consider (the rather silly example of) $G=\\mathbb R^3 \\times \\mathbb R^3$ and $\\mathcal Q = \\mathbb R^3$. Nevertheless, since we are going to probe all possible variations of $\\gamma$, we only need to worry about the \u0026ldquo;surjectivity\u0026rdquo; of this formulation, rather than its \u0026ldquo;injectivity\u0026rdquo;. Surjectivity follows from our assumption of the group action being transitive.\nIf $G$ is a matrix Lie group, the expression\n\\[ \\begin{align*} \\frac{d}{d t}\u0026\\left(\\left( \\frac{\\partial}{\\partial \\lambda} \\Gamma(t, \\lambda)\\Big\\rvert_{\\lambda=0}\\right)g(t)^{-1}\\right) \\end{align*} \\] can be viewed from a purely matrix-algebraic lens. This is what Marsden and Ratiu do in their book, so I leave the remaining details to them.\nC. Proof of $[{\\overline X, \\overline Y}] = -\\overline{[ X, Y]}$ Let $\\bar L^{(g)} (q) \\coloneq \\Phi(g,q) = \\Phi^{(q)}(g)= g\\cdot q$. We will reuse this notation for left and right multiplication on $G$, so that $L^{(g)}(h)=R^{(h)}(g) = gh$. The following equalities hold:\n$$ \\begin{align*} g\\cdot(h\\cdot q)\u0026=gh\\cdot q\\\\ \\quad\\bar L^{(g)}\\circ \\bar L^{(h)} \u0026= \\bar L^{(gh)}\\\\ \\bar L^{(g)} \\circ \\Phi^{(q)} \u0026= \\Phi^{(q)}\\circ L^{(g)}\\ \\ \\\\ \\Phi^{(h\\cdot q)} \u0026= \\Phi^{(q)}\\circ R^{(h)} \\end{align*} $$ As a mnemonic, we can think of $\\Phi^{(q)}$ as \u0026ldquo;right-multiplication by $q$\u0026rdquo;, so that it \u0026ldquo;commutes\u0026rdquo; with $\\bar L^{(g)}$. Next, we need to demonstrate the fact that $\\overline{\\mathrm{Ad}_g Y}={\\bar L^{(g^{-1})}}^\\ast\\hspace{1pt} \\overline Y$. The vector field on the left has, at $q\\in\\mathcal Q$, the value\n$$ \\begin{align*} d\\Phi^{(q)}_e (\\mathrm{Ad}_g Y) \u0026= d\\Phi^{(q)}_e dR^{(g^{-1})}_{g} dL^{(g)}_e Y\\\\ \u0026= d\\Phi^{(g^{-1}\\cdot \\hspace{1pt}q)}_{g} dL^{(g)}_e Y. \\end{align*} $$The corresponding vector on the right is\n$$ ({L^{(g^{-1})}}^\\ast\\hspace{1pt} \\overline Y)_q = (L^{(g)}_\\ast\\hspace{1pt} \\overline Y)_q = {(d \\bar L_{g})}_{g^{-1}\\cdot \\hspace{1pt}q} d\\Phi_{e}^{(g^{-1} \\cdot\\hspace{1pt} q)} Y, $$and the \u0026ldquo;commutativity\u0026rdquo; of $\\bar L^{(g)}$ and $\\Phi^{(q)}$ completes the argument.\nAs I hinted at previously, the fact that we have chosen to work with a left group action (rather confusingly) makes it so that the induced VFs on $\\mathcal Q$ are closely related to the RIVFs on $G$. If $X^L$ and $X^R$ are the LIVF and RIVF of $X$, then $(\\mathrm{Ad}_gX)^R = X^L$. The homogeneous space version of this result is precisely what we have shown. Next, choose $g=\\exp(tX)$ and differentiate w.r.t $t$ (i.e., evaluate the pushforward of $\\frac{\\partial}{\\partial t}$):\n$$ \\begin{align*} \\overline{\\frac{d}{dt}\\mathrm{Ad}_{\\exp(t X)} Y\\Big\\rvert_{t=0}} \u0026= \\frac{d}{dt} {\\bar L^{(\\exp(-tX))}}^\\ast \\overline Y\\Big\\rvert_{t=0}\\\\ \\overline{\\mathrm{ad}_X Y} \u0026= -\\frac{d}{dt} {\\bar L^{(\\exp(tX))}}^\\ast \\overline Y\\Big\\rvert_{t=0}\\\\ \\overline{[X,Y]} \u0026= -[\\overline{X}, \\overline{Y}]. \\end{align*} $$The last line follows from the fact that $\\bar L^{(\\exp(tX))}$ is the flow of $\\overline{X}$, and that the Lie bracket of vector fields is (by definition) the Lie derivative.\nI\u0026rsquo;m grateful to my colleague Jöel Bensoam for introducing me to this paper (and to variational calculus)!\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n$\\Gamma$ is in fact a representative of the equivalence class defined by $\\delta\\hspace{-1.5pt}\\gamma$. Moreover, note that since the variation lives in a tangent space of $\\mathcal P \\mathcal Q$, it should act on an object of the type $f:\\mathcal P\\mathcal Q \\rightarrow \\mathbb R$ (called a functional \u0026ndash; a function that eats a path and spits out a number). This action is defined as follows: $$\\delta\\hspace{-1.5pt}\\gamma\\hspace{1pt}(f) = \\frac{\\partial}{\\partial\\lambda} f\\left(\\Gamma(\\ \\cdot\\ , \\lambda)\\right)\\Big\\rvert_{\\lambda=0} = \\mathbf D f(\\gamma) \\cdot \\delta\\hspace{-1.5pt}\\gamma,$$ wherein the last piece of notation is explained in Marsden and Ratiu\u0026rsquo;s books, as well as in my other post .\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nAn example showing the importance of constraints is as follows. We can use variational calculus to show that the shortest path between two points $\\mathbf p,\\mathbf p^\\prime \\in \\mathbb R^n$ is a straight line. Since perturbations of the straight line $\\gamma(t)= \\mathbf p + t(\\mathbf p^\\prime-\\mathbf p)$ that keep $\\gamma(0)$ and $\\gamma(1)$ fixed can only increase the length, we conclude that the straight line is indeed the shortest path. However, if we drop the constraint on $\\gamma(0)$ and $\\gamma(1)$, then there exists a perturbation that moves the points closer. Basically, we need to impose the constraints $\\delta\\hspace{-1.5pt}\\gamma(0) = 0$ and $=\\delta\\hspace{-1.5pt}\\gamma(1)=0$ to properly formulate the geometric problem we have in mind.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nI recommend that the reader go to Chapter 1 of Mechanics \u0026amp; Symmetry and identify the point in the proof of the Euler-Poincaré equations where the relationship between $\\delta\\hspace{-1.5pt}\\gamma$ and $\\delta\\hspace{-1.5pt}\\eta^i$ is used. There, $\\gamma(t)$ is described by $q^i(t)$ and $\\dot\\gamma(t)$ by $\\dot q^i(t)$.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nThe formula for the covariant derivative of a vector field along $\\gamma$ is given in Lee IRM. Instead of writing $\\nabla_{\\dot\\gamma(t)}\\overline E_{i,\\gamma(t)}$, we should properly extend either vector field to an open set of $\\mathcal Q$ before evaluating the derivative. We will ignore this technicality to economize on our notation.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nNote that the (translation-invariant) integration measure on $[0,1]$ is unique up to scaling; choosing a scaling is like choosing how fast (and in which direction!) time flows, and does not influence the minimizer. Secondly, the definition of $\\mathscr A$ relies on the fact that we can canonically lift $\\gamma$ to define $\\tilde\\gamma$ (such a lifting doesn\u0026rsquo;t come from a choice of connection on $T\\mathcal Q$).\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://shirazkn.github.io/posts/ep/","summary":"Poincaré\u0026rsquo;s 1901 paper introduces (in just a humble 3 pages) the Euler-Poincaré equations, which are the specialization of the Euler-Lagrange equations to the case where a Lie group acts on the configuration manifold. I work through Poincaré\u0026rsquo;s paper without making too many identifications.","title":"Euler-Poincaré Equations"},{"content":"The Fourier transform in $\\mathbb R^n$ maps a complex-valued function $f:\\mathbb R^n \\rightarrow \\mathbb C$ to a function of its Pontryagin dual space (which is in this case also $\\mathbb R^n$), $\\ \\hat f:\\mathbb R^n \\rightarrow \\mathbb C$, given by\n$$ \\begin{align*} \\hat f\u0026(\\boldsymbol \\omega) \\\\\u0026\\coloneqq \\int_{\\mathbb R^n} f(\\mathbf x) e^{-i\\hspace{1pt} 2\\pi \\langle \\boldsymbol \\omega, \\mathbf x\\rangle} d\\mathbf x\\\\ \u0026= \\int_{\\mathbb R}\\cdots\\int_{\\mathbb R} f(\\mathbf x) \\left(e^{-i \\hspace{1pt}2\\pi \\omega^1 x^1}\\cdots e^{-i \\hspace{1pt}2\\pi \\omega^n x^n}\\right) dx^1 \\cdots dx^n\\\\ \u0026= \\cdots \\left(\\int_{\\mathbb R}\\left(\\int_{\\mathbb R} f(\\mathbf x) e^{-i\\hspace{1pt}2\\pi \\omega^1 x^1}dx^1 \\right) e^{-i\\hspace{1pt}2\\pi \\omega^2 x^2} dx^2 \\right)\\cdots \\end{align*} $$That is, it performs the Fourier transform (FT) along each dimension, in succession. On compact quotients of $\\mathbb R^n$, like $\\mathbb T^n \\cong \\mathbb R^n/\\mathbb Z^n$ (where each $\\mathbb T\\cong S^1$ is the one-dimensional torus ), this becomes a Fourier series .\nDiscretization \u0026amp; Computation Since we know that the $n$-dimensional FT is really just the one-dimensional FT iterated, let\u0026rsquo;s restrict to the latter. Consider a function $f: \\mathbb R \\rightarrow \\mathbb C$ that has been sampled on a regularly-spaced lattice in $\\mathbb R$, i.e., a set of points $(x_j)_{j\\in \\mathbb Z}$ with $\\Delta \\coloneqq(x_j - x_{j-1})$ denoting the \u0026ldquo;sampling period\u0026rdquo;. For simplicity, let $x_0 = 0$ so that $x_j = j \\Delta$. The Fourier transform integral can be discretized such that the resulting summation only depends on the information we have access to:\n$$ \\tilde f(\\omega) \\coloneqq \\sum_{j\\in \\mathbb Z} f(x_j) e^{-i\\hspace{1pt}2\\pi \\omega x_j} \\Delta \\approx \\int_{\\mathbb R} f(x) e^{-i\\hspace{1pt}2\\pi \\omega x}dx = \\hat f(\\omega). $$Now,\n$$ \\begin{align*} \\tilde f(\\omega + \\tfrac{1}{\\Delta}) \u0026= \\sum_{j\\in \\mathbb Z} f(x_j) e^{-i\\hspace{1pt}2\\pi \\left(\\omega + \\tfrac{1}{\\Delta} \\right)j\\Delta} \\Delta\\\\ \u0026= \\sum_{j\\in \\mathbb Z} f(x_j) e^{-i\\hspace{1pt}2\\pi \\omega j \\Delta} \\Delta \\\\ \u0026= \\tilde f(\\omega), \\end{align*} $$showing that $\\tilde f$ is periodic with period $\\tfrac{1}{\\Delta}$. The transform $f \\mapsto \\tilde f$ is called the Discrete-Time Fourier Transform (DTFT). The periodicity of $\\tilde f$ introduces limits in the inverse transform.\nIf we have only a finite number of samples $(x_i)_{i=0}^{N-1}$, then we obtain the (rather confusingly named) Discrete Fourier Transform (DFT) . To do this, we assume that the samples are periodic with period $N$; either this periodicity is due to the underlying periodicity of $f$, or we naively let the function be periodic outside $[0, N\\Delta] \\subseteq \\mathbb Z$ because we do not care about the behavior of $f$ outside of this domain. In either case,\n$$ \\begin{align} \\tilde f(\\omega) \u0026= \\sum_{j\\in \\mathbb Z} f(x_j) e^{-i\\hspace{1pt}2\\pi \\omega j \\Delta} \\Delta \\\\ \u0026= \\sum_{k\\in \\mathbb Z}\\sum_{j=0}^{N-1} f(x_j) e^{-i\\hspace{1pt}2\\pi \\omega (k N + j) \\Delta} \\Delta\\\\ \u0026= \\sum_{j=0}^{N-1} f(x_j) \\left(\\sum_{k\\in \\mathbb Z}e^{-i\\hspace{1pt}2\\pi \\omega k N \\Delta}\\right) e^{-i\\hspace{1pt}2\\pi \\omega j \\Delta}\\Delta \\end{align} $$Using the (rather mysterious) Poisson summation formula , we get\n$$ \\begin{align} \\tilde f(\\omega) \u0026= \\sum_{j=0}^{N-1} f(x_j) \\left(2\\pi \\sum_{k\\in \\mathbb Z} \\delta(\\omega N \\Delta + k) \\right) e^{-i\\hspace{1pt}2\\pi \\omega j \\Delta}\\Delta \\end{align} $$Thus, the DTFT of a periodic sequence is non-zero precisely when $\\omega N \\Delta$ is an integer, i.e., $\\omega = \\ell/N\\Delta$ for some $\\ell \\in\\mathbb Z$. Since\n$$\\tilde f\\left(\\frac{\\ell + N}{N\\Delta}\\right) = \\tilde f\\left(\\frac{\\ell}{N\\Delta} + \\frac{1}{\\Delta}\\right) = \\tilde f\\left(\\frac{\\ell}{N\\Delta}\\right),$$it is $N$-periodic in the Fourier domain; we only need to compute $N$ coefficients to realize this transform. In summary, the DFT can be thought of as \u0026ldquo;the DTFT of a finite sequence of samples\u0026rdquo;, as computed under the assumption that the function is periodic outside the domain of the observed samples. In turn, the DTFT is simply a discretization of the FT. I present this table from Wikipedia :\nTime/Spatial Domain Frequency Domain Transform Name Discrete / Periodic Discrete / Periodic Discrete Fourier Transform (DFT) Continuous / Periodic Discrete / Aperiodic Fourier Series Discrete / Aperiodic Continuous / Periodic Discrete-Time Fourier Transform (DTFT) Continuous / Aperiodic Continuous / Aperiodic Fourier Transform (FT) Finally, the Fast Fourier Transform (FFT) is just a way to compute the DFT efficiently by exploiting the underlying redundancy of the computations. MATLAB has a function for computing the $n$-dimensional FFT, whose documentation mentions that this is mathematically just the one-dimensional FFT iterated.\nThe Compact, Non-Abelian Case So we know how the FT works on locally compact (which essentially means noncompact) abelian groups. Now, the theory of Pontryagin duality breaks down on non-abelian groups.1 We can nonetheless ask whether a function $f:G \\rightarrow \\mathbb C$ may be transformed, without loss of information, into another function $\\hat f: \\hat G \\rightarrow \\mathbb C$, where $\\hat G$ is some sort of a dual space. The answer is yes, but the details are intricate.\nInstead of things like $e^{-i \\omega x}$, one needs to introduce the so-called irreducible unitary representations (IURs) of $G$ (in the abelian case, these are called characters and constitute a group). A representation $U:G \\rightarrow GL(V)$ is a group homomorphism from $G$ to the invertible linear operators on some vector space $V$. It is said to be unitary if ($V$ comes with an inner product, and) $U(g)$ is a unitary operator for all $g \\in G$. We won\u0026rsquo;t worry about what irreducible means, here, but it is a crucial ingredient in this theory. Note that $U(g^{-1}) = U(g)^{-1} = U(g)^{\\dagger}$. For instance, $e^{i\\omega x}$ is an IUR of $\\mathbb R$, whose complex conjugate is $e^{-i\\omega x}$. For $SU(2)$, the IURs act on homogeneous polynomials (which can be given a vector space structure). A subset of the $SU(2)$ IURs become the IURs for $SO(3)$ (see Unitary Representations by Mitsuo Sugiora, Thm. 7.1).\nThe dual space $\\hat G$ is the space of all (equivalence classes of) IURs of $G$ (the details are outside of our scope). Let\u0026rsquo;s say that the equivalence classes of IURs can be parameterized by some $\\lambda$, so that $\\big(\\hspace{1pt} U^\\lambda \\mathrel\\vert \\lambda \\in \\hat G \\big)$ represents all the distinct IURs (i.e., does not double-count any that are equivalent). The Fourier transform of $f$ (square-integrable function on $G$) is then given by2\n$$\\hat f(\\lambda) = \\int_G f(g) U^\\lambda(g^{-1}) dg= \\int_G f(g) \\left(U^\\lambda(g)\\right)^{\\dagger} dg,$$where $dg$ is the Haar measure on $G$ and $\\hat f(\\lambda)$ is an operator. If we choose a basis $(\\mathbf e_i)_{i=1}^{\\textrm{dim}(V^\\lambda)}$ for $V^\\lambda$ (i.e., the vector space that $U^\\lambda$ acts on), then we can compute the $(i,j)^{th}$ component of $\\hat f(\\lambda)$ as\n$$\\hat f(\\lambda)_{ij} = \\int_G f(g) \\left[\\left(U^\\lambda(g)\\right)^{\\dagger}\\right]_{ij} dg. $$The inverse Fourier transform (which hinges on the Peter-Weyl theorem) is given by (Thm. 8.1 of Sugiura)\n$$ \\begin{align} f(g) \u0026= \\sum_{\\lambda \\in \\hat G} \\textrm{dim}(V^\\lambda) \\left(\\sum_{i,j=1}^{\\textrm{dim}(V^\\lambda)} \\hat f(\\lambda) _{ji}U^\\lambda(g)_{ij}\\right)\\\\ \u0026= \\sum_{\\lambda \\in \\hat G} \\textrm{dim}(V^\\lambda) \\,\\mathrm{Trace}\\left( \\hat f(\\lambda)U^\\lambda(g)\\right). \\end{align} $$In the case of $SO(3)$, the IURs $U^\\lambda(g)$ are the Wigner $D$-matrices, $\\lambda$ ranges over the nonnegative integers, and $\\textrm{dim}(V^\\lambda) = 2\\lambda + 1$. As in the case of the classical Fourier transform, compactness of the original domain implies that the IUR-space (i.e., frequency domain) is discrete.\nAny representation of the group yields a corresponding representation of the Lie algebra:\n$$ u^\\lambda(X) \\coloneqq \\frac{d}{dt}U^\\lambda\\left(\\exp(tX)\\right)\\bigg\\vert_{t=0}. $$This object gives a commutative diagram for the actions of the left-invariant vector fields (LIVFs) on $f$. If $X^L$ is the LIVF generated by $X\\in \\mathfrak g$, then\n$$ \\widehat{X^L f}(\\lambda) = -u^\\lambda(X)\\hat f(\\lambda), $$$$ \\widehat{X^R f}(\\lambda) = -\\hat f(\\lambda)u^\\lambda(X). $$Other classical results, like the Parseval/Plancheral identity and the convolution theorem hold, but the convolution is generally non-commutative.\nThe Locally Compact, Non-Abelian Case The theory gets more complicated here, and the literature sparse. The IURs are parametrized by the positive real numbers (i.e., $\\lambda \u003e0$). As we are no longer on a compact domain, the space of (equivalence classes of) IURs is continuous.\nThe IURs of $SE(2)$ act on the $L^2$ space of functions on the circle; or rather, we will think of these as functions on $SO(2)$. Letting $\\phi \\in SO(2)$ be a point and $\\zeta\\in L^2(SO(2))$ a function on the circle, respectively, we define\n$$ \\left[U^\\lambda (g) \\zeta\\right](\\phi)\\coloneqq e^{-i \\lambda \\hspace{1pt}(\\hspace{1pt}{\\mathbf t}_x\\cos \\phi \\hspace{1pt}+\\hspace{1pt} {\\mathbf t}_y \\sin \\phi)} \\zeta(\\phi - \\theta) $$where $g\\coloneq(\\theta, \\mathbf t) \\in SE(2)$ is the element being represented. Here, $U^\\lambda(g) \\zeta$ represents the action of $U^\\lambda(g)$ on $\\zeta$, whose result is another function on $SO(2)$. With these definitions, the Fourier transform of $SE(2)$ looks identical to the one in the compact case. To find the component-wise description of the transform, consider the basis $(e^{ik\\phi})_{k\\in \\mathbb Z}$ for functions on the circle (with the circle identified with $\\mathbb R/2\\pi\\mathbb Z$), and let $\\left[U^\\lambda(g)\\right]_{mn} = \\langle e^{im\\phi}, U^\\lambda(g)e^{in\\phi}\\rangle$. The $SE(2)$ Fourier transform then becomes (see Chirikjian \u0026amp; Kyatkin, Sec. 11.2) $$ \\begin{align} \\hat f_{mn}(\\lambda) = \u0026\\int_{\\mathbb R^2} \\int_0^{2\\pi} \\int_0^{2\\pi} f(\\theta, \\mathbf t) e^{in\\phi} \\nonumber \\\\\u0026e^{i \\lambda \\hspace{1pt}(\\hspace{1pt}{\\mathbf t}_x\\cos \\phi \\hspace{1pt}+\\hspace{1pt} {\\mathbf t}_y \\sin \\phi)}e^{-im(\\phi - \\theta)} \\hspace{1pt}d\\phi \\hspace{1pt}d\\theta \\hspace{1pt}d {\\mathbf t}_x d{\\mathbf t}_y. \\end{align} $$However, the inverse Fourier transform given above doesn\u0026rsquo;t make sense since $V^\\lambda$ is infinite-dimensional. Rather, we have\n$$ \\begin{align} f(g) \u0026= \\int_0^\\infty \\left(\\sum_{i,j\\in \\mathbb Z} \\hat f(\\lambda) _{ji}U^\\lambda(g)_{ij}\\right) \\lambda\\, d\\lambda\\\\ \u0026= \\int_0^\\infty \\mathrm{Trace}\\big(\\hat f(\\lambda)U^\\lambda(g)\\big) \\lambda\\, d\\lambda. \\end{align} $$When I first saw this, I wondered if the extra $\\lambda$ in \u0026ldquo;$\\lambda\\,d\\lambda$\u0026rdquo; was a typo!\nIn the $SE(3)$ case, the IURs act on functions on the sphere, which can be expanded in the basis of spherical harmonics. We incur a `$\\lambda^2 d\\lambda$\u0026rsquo; term in the inverse transform. Rather than repeating the details, I recommend the tutorial article GS Chirikjian, Degenerate Diffusions and Harmonic Analysis on SE(3): A Tutorial (2015) written by my postdoc advisor.\nAn extension of it does exist for the compact non-abelian case. Locally compact non-abelian groups (like $SE(n)$) are out of the question.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nSee A Course in Abstract Harmonic Analysis by Folland (Sec. 4.2) or Chirikjian and Kyatkin (Sec. 8.3.2).\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://shirazkn.github.io/posts/harmonic-analysis/","summary":"The Fourier transform maps a complex-valued function to a function on its Pontryagin dual space. Generalizations of this concept to non-Abelian compact and locally compact Lie groups are reviewed.","title":"Harmonic Analysis"},{"content":"Despite having encountered the Lagrangian and Hamiltonian formalisms of mechanics several times in a variety of engineering and physics settings, I had never been able to retain it in my memory. I had maintained a similar dissatisfaction with the many formulae of multivariable calculus, which only really ✨clicked✨ for me once I learned about the exterior derivative and the generalized Stokes\u0026rsquo; theorem.\nIn this post, I would like to collect my thoughts on the differential geometric treatment of Lagrangian and Hamiltonian mechanics, which assume a very simple and memorable form once we introduce the language of symplectic geometry . Of course, to have made our mathematical journey to this point, where we are able to say anything at all about symplectic geometry, was a not-so-simple task.\nSymplectic Forms Let $(\\mathcal M, g)$ be a Riemannian manifold and $p\\in \\mathcal M$ a point on it. Recall that $g_p$ is an inner product on $T_p\\mathcal M$ that is also smooth as a function of $p$. If we pick coordinates on some open set of $\\mathcal M$, the Riemannian metric can be expressed in this coordinate frame using a symmetric positive definite matrix $\\mathbf G=[g_{ij}]$, such that $g= g_{ij} dx^i dx^j$.1 Since any matrix can be decomposed into a symmetric and a skew-symmetric part, one can wonder: what sort of mathematical structure might a (invertible) skew-symmetric matrix of coefficients correspond to? This line of investigation leads us to the notion of a symplectic form on $\\mathcal M$. A symplectic form is a closed, non-degenerate, skew-symmetric $2$-form $\\omega$ on $\\mathcal M$. In a coordinate frame near $p$, it can be written as $\\omega = \\omega_{ij}\\hspace{2pt} dx^i \\wedge dx^j$. The non-degeneracy condition says that the skew-symmetric matrix $\\Omega \\coloneqq [\\omega^{ij}]$ should be invertible. Interestingly, since the skew symmetric matrices are singular (i.e., rank deficient) in odd dimensions, symplectic forms only exist on even-dimensional manifolds. We will say more about symplectic forms after considering a specific type of an even-dimensional manifold, namely the cotangent bundle (i.e., the \u0026ldquo;momentum phase-space\u0026rdquo;).\nHamiltonian Mechanics on $T^\\ast\\mathcal Q$ Symplectic Structure Let $\\mathcal Q$ be a smooth $n$-dimensional manifold, which we may call the \u0026ldquo;configuration space\u0026rdquo;. The tangent bundle $T\\mathcal Q$ of $\\mathcal Q$ contains points of the form $(q, \\dot {q})$, whereas its cotangent bundle $T^\\ast\\mathcal Q$ contains points of the form $(q, p)$ such that $p(\\dot {q}) \\in \\mathbb R$. In particular, $T^\\ast\\mathcal Q$ is even-dimensional since it has dimension $2n$. Either bundle comes with a projection map, $\\pi: T^\\ast\\mathcal Q \\to \\mathcal Q$, which sends $(q, p)$ to $q$. Now, we know that smooth maps can pull back differential forms, whereas they push forward vector fields. In this sense the cotangent bundle $T^\\ast\\mathcal Q$ is different from the tangent bundle $T\\mathcal Q$; there is a very natural (i.e., not based on any arbitrary choices) construction waiting to be made on $T^\\ast\\mathcal Q$. The object we are about to conjure is called the tautological $1$-form2 on $T^\\ast\\mathcal Q$, denoted as $\\tau$. Since we eventually want a $2$-form $\\omega$ on $T^\\ast\\mathcal Q$, we will need to take the exterior derivative of $\\tau$.\nAt any point $(q, p)$ on $T^\\ast \\mathcal Q$, we use $\\pi$ to pull back the only linear form that we have available at hand, namely the form $p$ at $q$. Thus, we define $\\tau_{(q, p)} = d\\pi^\\ast_{(q, p)}p$. Note that $\\tau_{(q, p)} $ will eat vectors in $T_{(q, p)}(T^\\ast\\mathcal Q)$ while $p$ ate vectors in $T_q \\mathcal Q$, so this is a bonafide construction and not a mere renaming of objects. The symplectic form is defined as $\\omega = -d \\tau$; the negative sign is just by convention as far as I can tell!\nSuppose we have a coordinate chart $(q^i)_{i=1}^n$ on $U\\subseteq \\mathcal Q$, so that $q^i:U\\rightarrow \\mathbb R$. Since any differential form on $U$ can be expressed as $p = p_i dq^i$, we can think of $(q^i, p_i)_{i=1}^n$ as being a coordinate chart on the $2n$-dimensional manifold $T^\\ast\\mathcal Q$. Accordingly, $(dq^i, dp_i)_{i=1}^n$ is a basis for one-forms on $T^\\ast\\mathcal Q$. It can be shown that the tautological $1$-form is $$\\tau = p_i \\hspace{1pt}dq^i + 0\\hspace{1pt} dp_i = p_i \\hspace{1pt}dq^i$$ in these coordinates3, so that the symplectic form $\\omega$ becomes $\\omega = - d\\tau = dq^i \\wedge dp_i$. Since it is an exact form , $\\omega$ is closed: $d\\omega = -d(d\\tau)=0$. Letting $\\Omega$ denote the matrix of coefficients corresponding to this coordinate description of $\\omega$, and ordering the coordinates as $(q^1, \\ldots, q^n, p_1, \\ldots, p_n)$, we have\n\\[ \\begin{align*} \\Omega = \\begin{bmatrix} 0 \u0026 \\mathbf I \\\\ -\\mathbf I \u0026 0 \\end{bmatrix}, \\end{align*} \\] where $\\mathbf I$ is the $n\\times n$ identity matrix.\nThe important Darboux theorem says that any symplectic form can be (locally) expressed as $dq^i \\wedge dp_i$ for some choice of coordinates $(q^i,p_i)$. Note that an analogous statement does not hold true for Riemannian metrics! Only the Riemannian metrics of flat manifolds admit coordinates such that the metric tensor becomes $\\mathbf I$.\nHamiltonian Vector Fields So far we have not endowed $T^\\ast\\mathcal Q$ with a Riemannian structure, but only a symplectic structure. Recall that a Riemannian metric lets us turn functions on the manifold into vector fields by means of the gradient operation. The symplectic analog of a gradient is the Hamiltonian vector field associated with a function $H: T^\\ast\\mathcal Q \\to \\mathbb R$. Here, $H$ is called the Hamiltonian. While the gradient vector field describes the directions in which a function changes, the Hamiltonian vector field describes the directions in which it does not.\nThe Hamiltonian vector field $X_H$ is defined by the equation $$\\omega(X_H,{}\\cdot{}) = dH,$$ analogous to how a gradient is defined as $\\langle \\textrm{grad} f, {}\\cdot{}\\rangle = df$. Let $X_H = Z^i \\frac{\\partial}{\\partial q^i} + Y^i \\frac{\\partial}{\\partial p_i}.$ Then, the defining equation of $X_H$ is\n\\[ \\begin{align*} \\begin{bmatrix} Z^i \u0026 Y^i \\end{bmatrix} \\begin{bmatrix} 0 \u0026 \\mathbf I \\\\ -\\mathbf I \u0026 0 \\end{bmatrix} = \\begin{bmatrix} \\frac{\\partial H}{\\partial q^i} \u0026 \\frac{\\partial H}{\\partial p^i} \\end{bmatrix}, \\end{align*} \\] and by inverting $\\Omega$, we get\n\\[ \\begin{align*} \\begin{bmatrix} Z^i \u0026 Y^i \\end{bmatrix} \u0026= \\begin{bmatrix} \\frac{\\partial H}{\\partial q^i} \u0026 \\frac{\\partial H}{\\partial p^i} \\end{bmatrix} \\begin{bmatrix} 0 \u0026 -\\mathbf I \\\\ \\mathbf I \u0026 0 \\end{bmatrix}\\\\ \u0026= \\begin{bmatrix} \\frac{\\partial H}{\\partial p^i} \u0026 -\\frac{\\partial H}{\\partial q^i} \\end{bmatrix}. \\end{align*} \\] The integral curves (i.e., trajectories) of $X_H$ are given by the solutions to the following system of differential equations:\n\\[ \\begin{align*} \\frac{\\partial q^i}{\\partial t} = \\frac{\\partial H}{\\partial p^i}, \\quad \\frac{\\partial p^i}{\\partial t} = -\\frac{\\partial H}{\\partial q^i}, \\end{align*} \\] which are precisely Hamilton\u0026rsquo;s equations of motion.\nThe Theorem of Emmy Noether Notice that $\\mathcal L_{X_H} H = dH(X_H) $$ = \\omega(X_H, X_H) = 0.$ Hence, the Hamiltonian $H$ does not change along an integral curve of its own Hamiltonian vector field $X_H$. It is also true that\n$$\\mathcal L_{X_H} \\omega = (d \\circ i_{X_H})\\hspace{1pt} \\omega + (i_{X_H} \\circ d)\\hspace{1pt} \\omega =0$$where we used Cartan\u0026rsquo;s magic formula for the first equality. The second term in the middle vanishes because $\\omega$ is closed, and the first term vanishes because $i_{X_H} \\omega = \\omega(X_H,\\hspace{1pt}\\cdot\\hspace{1pt}) = dH$ is closed. When the Lie derivative of the symplectic form along a given vector field vanishes, that vector field is said to be symplectic. A Hamiltonian vector field is symplectic.\nMore broadly, we say that $f \\in C^\\infty(T^\\ast\\mathcal Q)$ is a conserved quantity of the Hamiltonian system $(T^\\ast Q, \\omega, H)$ if $\\mathcal L_{X_H} f = X_H(f) = 0$. A vector field $V$ is called an infinitesimal symmetry if both $\\omega$ and $H$ are invariant under its flow. Noether\u0026rsquo;s theorem then says that every conserved quantity $f$ is itself the Hamiltonian of an infinitesimal symmetry (i.e., $\\omega$ and $H$ are invariant along the flow of $X_f$). The converse direction is true when the first de Rham cohomology group of $\\mathcal Q$ is trivial (a remark that I will not elaborate upon today, but this person will); the converse direction says that each infinitesimal symmetry arises as the $X_f$ of some underlying conserved quantity $f$. Thus,\n$$\\text{Conserved Quantities} \\longleftrightarrow \\text{Infinitesimal Symmetries}$$are in one-to-one correspondence (when $H_{\\textrm{dR}}^1=0$). The Poincaré lemma shows that both directions of Noether\u0026rsquo;s theorem hold when $\\mathcal Q$ is (diffeomorphic to) an open ball of $\\mathbb R^n$. By the way, notice the symmetry between the roles of $H$ and $f$; we could have begun our investigation by considering $f$ to be the Hamiltonian and $H$ the conserved quantity.\nPoisson Structure $T^\\ast \\mathcal Q$ with its canonical symplectic form $\\omega$ is an example of a Poisson manifold. That is, there is an $\\mathbb R$-bilinear mapping from two $C^\\infty(T^\\ast \\mathcal Q)$ functions to another; this mapping is called the Poisson bracket and is defined as follows:\n$$ \\lbrace f,g\\rbrace \\coloneqq \\omega(X_f, X_g). $$It satisfies certain axioms, including versions of the Leibnitz rule and Jacobi identity. Recalling that $X_g= \\frac{\\partial g}{\\partial q^i} \\frac{\\partial}{\\partial q^i} - \\frac{\\partial g}{\\partial p_i} \\frac{\\partial}{\\partial p_i}$, we have (for the Poisson bracket inherited from the canonical symplectic structure of $T^\\ast \\mathcal Q$)\n\\[ \\begin{align*} \\lbrace f, g \\rbrace \u0026= \\omega(X_f, X_g) = X_g f\\\\ \u0026= \\frac{\\partial g}{\\partial q^i} \\frac{\\partial f}{\\partial q^i} - \\frac{\\partial g}{\\partial p_i} \\frac{\\partial f}{\\partial p_i} \\end{align*} \\] which nicely captures the asymmetry between $f$ and $g$ in $\\lbrace f,g\\rbrace$. Notice that since $\\lbrace f, H\\rbrace=X_H f$, the evolution of $f$ under the flow of $X_H$ is given by the Poisson bracket too. Letting $\\alpha\\in T^\\ast Q$, \\[ \\begin{align*} \\left[\\lbrace f, H\\rbrace \\circ \\Psi_t\\right](\\alpha) \u0026= X_H f \\left(\\Psi_t(\\alpha)\\right)\\\\ \u0026=\\frac{d}{ds}f\\left(\\Psi_s(\\Psi_t(\\alpha))\\right)\\vert_{s=0}=\\frac{d}{ds}f\\left(\\Psi_s(\\alpha))\\right\\vert_{s=t}, \\end{align*} \\] where $\\Psi_t$ is the flow of $X_H$. Thus, a conserved quantity satisfies $\\lbrace f, H\\rbrace = 0$ and vice versa (Prop. 22.21, Lee ISM).\nPreviously, we used the symplectic form to define the Hamiltonian vector field of $H$. However, we could consider a more general definition that only uses the Poisson structure: $X_H \\coloneqq \\lbrace \\hspace{1pt}\\cdot\\hspace{1pt}, H \\rbrace.$ Hence, in the case of a symplectic manifold, we have $$ dH(X_f)= \\omega(X_H,X_f)=\\lbrace H,f \\rbrace = -\\lbrace f,H \\rbrace =-df (X_H)=-X_H(f) $$ where the first equality uses our earlier (symplectic) definition of $X_H$, verifying that everything has been defined consistently.\nMnemonic: In both, $\\omega\\left(\\hspace{1pt}\\cdot\\hspace{1pt},\\hspace{1pt}\\cdot\\hspace{1pt}\\right)$ and $\\lbrace\\hspace{1pt}\\cdot\\hspace{1pt},\\hspace{1pt}\\cdot\\hspace{1pt}\\rbrace$, the argument on the left 'gets differentiated' whereas the argument on the right differentiates (i.e., turns into a vector field). A minus sign shows up precisely when we interchange the roles of these arguments. There appears to be a confusing asymmetry in the way we define the Hamiltonian vector field; in the symplectic definition, $H$ is the object being differentiated, whereas in the Poisson algebraic definition, $H$ assumes the role of the differentiator. Thus, we see that what we really need is a Poisson structure, rather than a symplectic structure on a manifold (note that symplectic structure $\\Rightarrow$ Poisson structure, but not vice versa). A Poisson manifold is a manifold $\\mathcal M$ for whose space of smooth functions, $C^\\infty(\\mathcal M)$, we endow the structure of a Poisson algebra .\nA Poisson algebra is a vector space (e.g., $C^\\infty(\\mathcal M)$ with pointwise addition and scalar multiplication) with two products: $(i)$ a bilinear product (e.g., pointwise multiplication of $C^\\infty(\\mathcal M)$-functions) and $(ii)$ the Poisson bracket, $\\lbrace\\cdot,\\cdot\\rbrace$, satisfying certain axioms. In particular, we require that $C^\\infty(\\mathcal M)$ is a Lie algebra under $\\lbrace\\cdot,\\cdot\\rbrace$, and that $\\lbrace\\cdot,\\cdot\\rbrace$ satisfies a compatibility condition with the other product. Given a general Poisson bracket (i.e., one that is not defined in terms of a symplectic form), there may exist a function $f$ that makes $\\lbrace \\cdot, f\\rbrace$ vanish on all the other functions. This is not allowed in the symplectic case by virtue of the non-degeneracy of $\\omega$. An $f$ that has this property is called a Casimir function. For instance, the total angular momentum is a Casimir function for the Hamiltonian system describing rigid body dynamics. Due to the details given above, Casimir functions are invariant along all Hamiltonian systems on $T^\\ast Q$; the angular momentum must and will be conserved! Conversely, the flow of a Casimir function preserves other functions. For instance, the Hamiltonian flow of the angular momentum function is a rotation, signifying the invariance of physical phenomena to rotation of 3D space.\nIn this setting, a proof of Noether\u0026rsquo;s theorem is as follows: $\\lbrace f, H \\rbrace = X_H f = 0$ (i.e., $f$ is a conserved quantity) implies $\\lbrace H, f \\rbrace = X_f H = 0$ (i.e., $X_f$ is an infinitesimal symmetry). If the reader knows or learns more about de Rham cohomology, it will become clear why cohomology plays a role in proving one of the directions of the theorem (that is, the direction in which we start with an infinitesimal symmetry $X$).\nMomentum Maps If a Lie group $G$ acts by diffeomorphisms on $\\mathcal Q$, then its action can be lifted (canonically) into symplectomorphisms (the symplectic analog of isometries) on $T^\\ast \\mathcal Q$. We do this simply by considering the pullback map corresponding to the group action (see Chapter 6.3, Marsden \u0026amp; Ratiu IM\u0026amp;S).\nLet $\\Phi_g: T^\\ast \\mathcal Q \\rightarrow T^\\ast \\mathcal Q$ be the (canonically lifted) group action on $T^\\ast \\mathcal Q$ and $z = (q, p)\\in T^\\ast \\mathcal Q$ a point. We let $\\tilde \\xi$ denote the vector field on $T^\\ast \\mathcal Q$ defined by $$ \\tilde \\xi_{\\hspace{1pt}z}= \\frac{d}{dt}\\Phi_{\\exp(t\\xi)}\\left(z\\right)\\Big\\vert_{t=0}, $$ which is sometimes referred to as the infinitesimal generator of $\\Phi_{\\exp(\\xi)}$. Since $\\tilde\\xi$ is symplectic, at least locally it arises from a Hamiltonian: $\\tilde \\xi = X_{J_{\\xi}}$ for some function $J_{\\xi}$. The momentum map $\\mathbf J: T^\\ast \\mathcal Q \\rightarrow \\mathfrak g^\\ast$ is an object that lets us evaluate this Hamiltonian as a (linear) function of $\\xi$. That is,\n\\[ \\left[\\mathbf J(z)\\right](\\xi) = J_{\\xi}(z) \\] by construction. Thus, the momentum map generalizes the Hamiltonian to the case of vector fields that arise from group actions. The momentum map is not merely a Hamiltonian; rather, it encapsulates the entire family of Hamiltonians whose Hamiltonian vector fields are the infinitesimal generators. Using the definitions, we have $$ \\omega(\\tilde \\xi, \\cdot\\ )= \\omega(X_{J_\\xi}, \\cdot\\ )= dJ_\\xi. $$ For the general Poisson manifold (with or without an accompanying symplectic form), we have $$ \\tilde \\xi = X_{J_{\\xi}} = \\lbrace \\ \\cdot\\ , J_{\\xi}\\rbrace. $$ Lagrangian Mechanics on $T\\mathcal Q$ We have spent some time on the cotangent bundle, where Hamiltonian mechanics appears in a very natural way. For me, the attraction of the Lagrangian formalism is that the Lagrangian is very tangible; it comes from some sort of a variational calculus problem that we can write down. On the other hand, I have been told that $L=T - V$ is the Lagrangian. But where does $T - V$ come from, and how does this generalize to other (less classical) mechanical systems? This is not so clear to me.\nLagrangian mechanics is done on $T\\mathcal Q$ (i.e. the \u0026ldquo;velocity phase-space\u0026rdquo;). As the Hamiltonian is a function $H:T^\\ast \\mathcal Q \\rightarrow \\mathbb R$, the Lagrangian is a function $L:T\\mathcal Q \\rightarrow \\mathbb R$. For instance, to minimize some property of a curve we would write down the objective function as\n$$\\underset{\\gamma(\\cdot)\\hspace{1pt}\\in\\hspace{1pt}\\mathcal C}{\\textrm{minimize}}\\ \\int_{t_0}^{t_1} L\\left(\\gamma(t), \\dot \\gamma(t)\\right)\\hspace{1pt} dt$$where $\\mathcal C$ is some set of curves and the quantity $\\dot \\gamma(t)$ represents a tangent vector. Similar minimization problems can be set up to describe optimal trajectories and configurations of rigid bodies, fluids, surfaces (such as soap films and cloths), etc.4\nSymplectic Approach The tool that lets us move between $T\\mathcal Q$ and $T^\\ast\\mathcal Q$ is called the fiber derivative, which is also the same thing as the better-known Legendre transform. Since one might hope to pull the tautological and symplectic forms to $T\\mathcal Q$, it is expected that the fiber derivative should be a map from $T\\mathcal Q$ to $T^\\ast\\mathcal Q$. Indeed, the fiber derivative of $L$ is the map $\\mathbb F L: T\\mathcal Q \\rightarrow T^\\ast\\mathcal Q$ defined in Marsden and Ratiu\u0026rsquo;s book (and omitted here)5. It gives us a tautological one-form $\\tau^L$ and a $2$-form $\\omega^L$ on $T\\mathcal Q$. However, while the non-degeneracy of the symplectic form $\\omega$ is guaranteed on $T^\\ast\\mathcal Q$, the pullback form $\\omega^L$ on $T\\mathcal Q$ may be degenerate! In the case of systems for which the pullback $2$-form $\\omega^L \\coloneqq \\mathbb F L^\\ast(\\omega)$ is non-degenerate (and therefore6 a symplectic form), the Lagrangian and Hamiltonian formalisms are equivalent. In coordinates, we have $\\tau^L = \\frac{\\partial L}{\\partial \\dot q^i} dq^i$ and (by exterior differentiation)\n$$\\omega^L = \\frac{\\partial^2 L}{\\partial \\dot q^i \\partial q^j} dq^i \\wedge dq^j + \\frac{\\partial^2 L}{\\partial \\dot q^i \\partial \\dot q^j} dq^i \\wedge d\\dot q^j.$$ As shown in Marsden, there is a very straightforward condition for the non-degeneracy of $\\omega^L$: $$\\text{rank}\\left(\\frac{\\partial^2 L}{\\partial \\dot q^i \\partial \\dot q^j}\\right) = n.$$A Lagrangian vector field $Z$ is one such that $\\Omega^L(Z, \\cdot) = dE$, where $E$ is the energy function on $T\\mathcal Q$ defined by $E(q, \\dot q) = \\dot q^i \\frac{\\partial L}{\\partial \\dot q^i} - L(q, \\dot q)$.7 When $\\Omega^L$ is degenerate, the Lagrangian vector field $Z$ is not unique (which is perhaps why we do not write it as \u0026ldquo;$Z_{E}$\u0026rdquo;). The integral curves of $Z$ correspond to Lagrange\u0026rsquo;s equations of motion (Sec. 7.3 of Marsden and Ratiu).\nThe downside of the above approach is that it still does not tell me where the functions $L$ or $H$ come from. It just says that, given an initial piece of datum (be it a Hamiltonian $H$ or a Lagrangian $L$), there is a very natural sequence of steps that lets us write down Hamilton\u0026rsquo;s or Lagrange\u0026rsquo;s equations of motion in coordinates.\nVariational Approach A connected subset of the real line $[a,b]\\subseteq \\mathbb R$ is called an interval. A curve or a path in $\\mathcal Q$ is given by a smooth map $\\gamma:[a,b]\\rightarrow \\mathcal Q$. The quantity $$\\dot \\gamma(s) \\coloneqq d\\gamma_s \\left(\\frac{\\partial}{\\partial t}\\Big\\vert_{t=s}\\right)$$ can be thought of as the \u0026ldquo;velocity vector\u0026rdquo; corresponding to $\\gamma(\\cdot)$ at time $s$, i.e., the tangent vector along the curve $\\gamma$ at $\\gamma(s) \\in \\mathcal Q$. The space of all smooth curves is an infinite-dimensional manifold $\\mathcal C$, whose tangent space at $\\gamma$ (i.e., $T_{\\gamma}\\mathcal C$) is the space of all smooth vector fields along $\\gamma$. To see this, notice that if $\\gamma(\\hspace{1pt}\\cdot\\hspace{1pt},\\lambda)$ is a family of curves parameterized by $\\lambda$ (i.e., a curve in curve-space), then the tangent to this curve at $\\gamma(\\hspace{1pt}\\cdot\\hspace{1pt},0)$ is $\\frac{d}{d\\lambda}\\gamma(\\hspace{1pt}\\cdot\\hspace{1pt},\\lambda)\\big\\vert_{\\lambda=0}$. This object needs to be fed an argument in $[a,b]$ before it produces an element of $T\\mathcal Q$. Confusingly, many authors denote this object as $\\delta \\gamma$, referring to it as a variation of $\\gamma$.\nWe can rewrite the minimization problem mentioned earlier as follows: let $L: T\\mathcal Q \\rightarrow \\mathbb R$ be a Lagrangian. We let $\\mathfrak S$ denote the functional (i.e., a function which acts on other functions) defined by $$\\mathfrak S[\\gamma] = \\int_a^b L(\\gamma, \\dot \\gamma)\\hspace{3pt} dt$$ We want to find the curve in some appropriate submanifold of $\\mathcal C$ that minimizes $\\mathfrak S$, under the pretext that $L$ could represent things such as the length or potential energy of a $\\gamma$-shaped object. Just like in the finite-dimensional case of optimization, we can find the local minima of this problem by setting the exterior derivative $d\\mathfrak S_{\\tilde{\\gamma}}$ to $\\mathbf 0$ and solving for $\\tilde{\\gamma}$. Note that this \u0026ldquo;$\\mathbf 0$\u0026rdquo; is the origin of $T\\mathcal C^\\ast$, as opposed to the real number $0$.\nLet $X:[a,b]\\rightarrow T\\mathcal Q$ be a vector field along $\\tilde{\\gamma}$. Since $X$ also represents an element of $T_{\\tilde{\\gamma}}\\mathcal C$, the covector $d\\mathfrak S_{\\tilde{\\gamma}}$ can eat it to produce a number, which we set to $0$:\n\\[ [d\\mathfrak S_{\\tilde{\\gamma}}](X) = \\left[d\\left(\\int_a^b L(\\gamma, \\dot \\gamma) \\hspace{3pt} dt\\right)\\right] (X)=0. \\] Let $\\gamma(\\hspace{1pt}\\cdot\\hspace{1pt},\\lambda)$ be a representative \u0026ldquo;curve\u0026rdquo; corresponding to $X\\in T_{\\gamma}\\mathcal C$, such that $\\frac{d}{d\\lambda}\\gamma(\\hspace{1pt}\\cdot\\hspace{1pt},\\lambda)\\big\\vert_{\\lambda=0} = X$ and $\\gamma(\\hspace{1pt}\\cdot\\hspace{1pt},0) = \\tilde{\\gamma}$. Also,\n\\[ \\begin{align*} [d\\mathfrak S_{\\tilde{\\gamma}}](X) = \\frac{d}{d\\lambda}\\mathfrak S(\\gamma(\\cdot, \\lambda), \\dot\\gamma(\\cdot, \\lambda))\\Big\\vert_{\\lambda=0}. \\end{align*} \\] Because the differential structure of $\\mathcal Q$ does not interact whatsoever with the integral structure of $[a,b]$, we interchange the differentiation and integration as usual:\n\\[ \\begin{align*} [d\\mathfrak S_{\\tilde{\\gamma}}](X) \u0026= \\frac{d}{d\\lambda}\\int_a^b L(\\gamma(t, \\lambda), \\dot \\gamma(t,\\lambda)) \\hspace{3pt} dt\\ \\Big\\vert_{\\lambda=0}\\\\ \u0026= \\int_a^b \\frac{\\partial L}{\\partial \\gamma^i}\\frac{\\partial }{\\partial \\lambda}\\gamma^i(t,\\lambda)\\Big\\vert_{\\lambda=0}\\\\\u0026\\qquad + \\frac{\\partial L}{\\partial \\dot \\gamma^i}\\frac{\\partial}{\\partial \\lambda}\\dot \\gamma^i(t,\\lambda)\\Big\\vert_{\\lambda=0}\\hspace{3pt} dt, \\end{align*} \\] where we have made some implicit (but hopefully, reasonable) identifications. In particular, we let $\\gamma^i(t,\\lambda)$ be the coordinate expression of $\\gamma(t,\\lambda)$, and write $X_{\\tilde \\gamma(t)}=X^i(t) \\frac{\\partial}{\\partial q^i}\\big\\vert_{\\tilde\\gamma(t)}$. We have that $\\frac{\\partial }{\\partial \\lambda}\\gamma^i(t,\\lambda)\\big\\vert_{\\lambda=0}=X^i(t)$ by virtue of $ \\gamma(t,\\lambda)$ being a representative of $X$. As for the second term, we use the fact that $\\frac{\\partial}{\\partial \\lambda}$ and $\\frac{\\partial}{\\partial t}$ commute:\n\\[ \\begin{align*} [d\\mathfrak S_{\\tilde{\\gamma}}](X) \u0026= \\int_a^b \\frac{\\partial L}{\\partial \\gamma^i}X^i(t) + \\frac{\\partial L}{\\partial \\dot \\gamma^i}\\frac{\\partial}{\\partial \\lambda}\\left(\\frac{\\partial}{\\partial t} \\gamma^i\\right)\\bigg\\vert_{\\lambda=0}(t) \\hspace{3pt} dt \\nonumber \\\\ \u0026= \\int_a^b \\frac{\\partial L}{\\partial \\gamma^i}X^i(t) + \\frac{\\partial L}{\\partial \\dot \\gamma^i}\\frac{\\partial}{\\partial t}\\left(\\frac{\\partial}{\\partial \\lambda} \\gamma^i\\right)\\bigg\\vert_{\\lambda=0}(t) \\hspace{3pt} dt \\nonumber \\\\ \u0026= \\int_a^b \\frac{\\partial L}{\\partial \\gamma^i}X^i(t) + \\frac{\\partial L}{\\partial \\dot \\gamma^i}\\frac{d X^i}{d t} (t) \\hspace{3pt} dt \\end{align*} \\] and then integrate (the second term) by parts:\n\\[ \\begin{align*} [d\\mathfrak S_{\\tilde{\\gamma}}](X) \u0026= \\int_a^b \\frac{\\partial L}{\\partial \\gamma^i}X^i(t)\\hspace{2pt}dt + \\frac{\\partial L}{\\partial \\dot \\gamma^i}X^i(t)\\bigg\\rvert_a^b \\hspace{2pt} \\nonumber\\\\ \u0026\\quad - \\int_a^b \\frac{d}{dt}\\left(\\frac{\\partial L}{\\partial \\dot \\gamma^i}\\right)X^i(t) \\hspace{3pt} dt. \\end{align*} \\] If we hold the endpoints fixed, i.e., $\\gamma(a,\\hspace{1pt}\\cdot\\hspace{1pt})$ and $\\gamma(b,\\hspace{1pt}\\cdot\\hspace{1pt})$ are each constant, then $X^i(a) = X^i(b) = 0$. In this case, the condition $[d\\mathfrak S_{\\tilde{\\gamma}}]=0$ reduces to the Euler-Lagrange equations.8\nReduction We can specialize the above to the case of a dynamical system evolving on a Lie group $G$. Since the tangent and cotangent bundles of $G$ can be trivialized (e.g., $T^\\ast G \\cong G \\times \\mathfrak g^\\ast$), we leverage this additional structure to make the associated Hamiltonian and Lagrangian equations more \u0026ldquo;Cartesian\u0026rdquo; in flavor. This procedure, when applied to Hamiltonion mechanics on $T^\\ast G$, is called Lie-Poisson reduction; when applied to Lagrangian mechanics/variational problems on $TG$, it is called Euler-Poincaré reduction. I get a bit into the latter in a different post . As for the former, it is explained quite well by Marsden and Ratiu in Chapter 13 of \u0026ldquo;Introduction to Mechanics and Symmetry\u0026rdquo;.\nAppendix The last thing I offer in this post is an overview of the notation used in Marsden and Ratiu\u0026rsquo;s books. This is as much addressed to readers (if any) of my blog as it is to my future self \u0026ndash; I often find myself struggling to recall how any of these differential operators are defined. That M\u0026amp;R\u0026rsquo;s notation is so different from the one I\u0026rsquo;m used to (which is the notation of John M. Lee\u0026rsquo;s books) does not help.\nThe Many Derivatives of Marsden \u0026amp; Ratiu Maps between Normed Vector Spaces: Let $f:U\\subseteq \\mathbf E \\rightarrow \\mathbf F$, where $\\mathbf E$ and $\\mathbf F$ are normed vector spaces, and $ x, y\\in U$. We define $\\mathbf D f( x)$ as the operator that satisfies\n\\[ \\lim_{ y \\rightarrow x}\\frac{\\lVert f(y) - f(x) - \\mathbf D f( x) \\cdot ( y - x) \\rVert_{\\mathbf F}}{\\lVert y - x \\rVert_{\\mathbf E}} \\rightarrow 0. \\] These are called the Fréchet and Gateaux derivatives, respectively. Fréchet differentiability is the stronger notion of the two, since in its definition one is probing all directions of approach (c.f., \"$y \\rightarrow x$\") at once. Another way of writing this is that\n\\[ \\mathbf D f( x) \\cdot e = \\lim_{t\\rightarrow 0} \\frac{f( x + t e ) - f( x )}{t} \\] $\\forall e\\in \\mathbf E$. In coordinates, $\\mathbf D f $ is precisely the Jacobian of $f$, $[\\partial f^i/\\partial x^j]$.\nMaps between Manifolds: In one squints at the above expressions, it appears that $\\mathbf D f(x)$ is nothing but the differential of $f$ at $x$:\n\\[ df_{x}:T_{x}\\mathbf E \\rightarrow T_{f(x)}\\mathbf F, \\] except with the tangent spaces identified with their corresponding vector spaces, e.g., $T_{x}\\mathbf E \\cong \\mathbf E$. Unlike the object $\\mathbf D f$, the object $df$ generalizes beyond normed vector spaces; the domain and codomain of $f:\\mathcal M \\rightarrow \\mathcal N$ are allowed to be arbitrary smooth manifolds. Instead of $df_x$, M\u0026amp;R write \u0026ldquo;$T_x f$\u0026rdquo; on p. 124.\nExterior Derivative: When the codomain of $f$ is $\\mathbb R$, they write \u0026ldquo;$\\mathbf d f(x)$\u0026rdquo; instead of $df_x$ or \u0026ldquo;$T_x f$\u0026rdquo;. As much as I dislike the idea of introducing a special piece of notation for $\\mathbb R$-valued maps, I think this is a reasonable piece of notation since in this case $\\mathbf d f(x)$ is a bona fide one-form in $T_x^{\\ast}\\mathcal M$. After identifying $T_{f(x)} \\mathbb R$ with $\\mathbb R$, the operator $\\mathbf d$ is indeed the exterior derivative as the notation promises. Thank God!\nFunctional Derivative : This is yet another one of those unfortunately named (and notated) objects that in my opinion could simply be called the differential (or with some care, the gradient). Given $f:\\mathbf E \\rightarrow \\mathbb R$, let ${\\delta f}/{\\delta x} \\coloneqq df_{x}$ be the unique element of $T_x^*\\mathbf E \\cong \\mathbf E^\\ast$ that satisfies $$ \\mathbf D f(x)\\cdot e = \\frac{\\delta f}{\\delta x} (e) $$ Again, I would have just written $df_x(e)$ for this, with $e$ interpreted as an element of $T_x \\mathbf E$ as much as a vector in $\\mathbf E$. Unfortunately, M\u0026amp;R and other physicists write \u0026ldquo;$\\delta x$\u0026rdquo; instead of $e$, which complicates things further.\nGradient: The notion of a functional derivative can be generalized: consider a (nondegenerate) pairing $\\langle\\cdot, \\cdot\\rangle:\\mathbf E \\times \\mathbf F \\rightarrow \\mathbb R$ between two different vector spaces, $\\mathbf E$ and $\\mathbf F$. For instance, the \u0026ldquo;duality pairing\u0026rdquo; between $\\mathbf E$ and $\\mathbf E^\\ast$ is given by $\\langle e, \\varphi \\rangle = \\varphi(e)$. One defines the (generalized) gradient of $f:\\mathbf E\\rightarrow \\mathbb R$ at $x\\in\\mathbf E$ as the unique element $\\textup f\\in \\mathbf F$ such that $df_x(e)=\\langle e, \\textup f\\hspace{1pt}\\rangle$. Thus, the functional derivative is a sort of \u0026ldquo;gradient\u0026rdquo; under the duality pairing, while the colloquial notion of a \u0026ldquo;gradient vector\u0026rdquo; uses the inner product pairing of $\\mathbf E$ with itself.\nGradient (of a Functional): Let $(\\mathcal M, \\mu)$ be a measure space and $\\mathbf E = C^0(\\mathcal S)$ the vector space of continuous real-valued functions on $\\mathcal S\\subseteq \\mathcal M$, where $\\mathcal S$ satisfies all of the assumptions that will be needed of it. A point $\\varphi \\in \\mathbf E$ is a function of the form $\\varphi:\\mathcal S \\rightarrow \\mathbb R$. Let $f: \\mathbf E \\rightarrow \\mathbb R$ be a functional, i.e., a meta function that eats lesser functions and spits out real numbers, so that $f(\\varphi) \\in \\mathbb R$. We can let $\\mathbf E = \\mathbf F$ and consider the usual $L^2$ inner product on $\\mathbf E$ when defining the gradient.\nWhen defining the functional derivative above, we used the duality pairing, which meant that the \u0026ldquo;gradient\u0026rdquo; ${\\delta f}/{\\delta \\varphi}$ was a point in $\\mathbf E^\\ast$. But here, we want to use the $L^2$ inner product, so the gradient is instead a point in $\\mathbf E$ defined by the following equality: $$ \\mathbf D f(\\varphi)\\cdot e = \\int_{\\mathcal S} \\frac{\\delta f}{\\delta \\varphi}{\\small(}s{\\small)}\\ e{\\small(}s{\\small)} \\ d\\mu{\\small(}s{\\small)} $$where we used the fact that $T_{\\varphi} \\mathbf E \\cong \\mathbf E$ to view $e$ as both, a tangent vector in $T_{\\varphi} \\mathbf E$ and a vector in $\\mathbf E$ (which is in turn a function on $\\mathcal S$).\nThus, the Euler-Lagrange equations are simply the conditions under which the gradient (i.e., functional derivative) of $f$ vanishes, as it must at critical points. However, be warned that the identifications we made above only work because the codomain of $\\varphi$ has addition and multiplication.\nRecall that the notation \u0026ldquo;$dx^i dx^j$\u0026rdquo; implicitly refers to the symmetric $2$-tensor, $\\frac{1}{2}(dx^i \\otimes dx^j + dx^j \\otimes dx^i)$.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nThe word tautological refers to \u0026ldquo;obvious\u0026rdquo; or self-evident truths. The tautological one-form takes its name because of the self-reference involved in its definition, i.e., the object $p$ shows up both in the subscript as well as in the definition of $\\tau_{(q, p)}$.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nThis is easily verified by feeding either side a tangent vector. Also see Mechanics and Symmetry by Marsden \u0026amp; Ratiu or Smooth Manifolds by John M. Lee.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nFor the optimization of surfaces embedded in $\\mathbb R^3$, the objective function looks like $\\int_{[0,1]^2} L\\left(\\gamma, \\gamma_x, \\gamma_y\\right)\\hspace{1pt} dx dy$ where $x$ and $y$ are two time-like variables that take values in the parameter space $[0,1]^2$. This suggests that the Lagrangian density is best thought of as a top-degree form on the parameter space that can be integrated. When the parameter space is one-dimensional (e.g., when a trajectory is parametrized by time or a curve by its arc length), we get something like $\\int_{0}^1 L dt$.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nMaybe something I could mention here is that $\\mathbb F L$ differentiates within fibers. Unlike the other types of derivatives we have for tensor fields on manifolds, the definition of $\\mathbb F L$ exploits the vector space structure of the fibers. Remember that in order to differentiate across fibers we need a connection .\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n$\\omega^L$ is closed because the exterior derivative commutes with pullbacks (a property known as naturality). This also means that $\\tau^L$ and $\\omega^L$ have the same relationship to each other as $\\tau$ and $\\omega$ do.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nI\u0026rsquo;m not sure what the motivation for this definition is. The most I can say is that the first term is actually $\\mathbb F L(q,\\dot q)$ acting on $\\dot q$ in a tautological manner. It is also in some sense the linearization of $L$. So, $E$ is like (the negative of) $L$ minus the linearization of $L$.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nSomething that\u0026rsquo;s missing from this article is an explanation of how Noether\u0026rsquo;s theorem figures into the framework of Lagrangian mechanics. I will revisit and update this post if I ever get into it.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://shirazkn.github.io/posts/symplectic/","summary":"Despite having encountered the Lagrangian and Hamiltonian formalisms of mechanics several times in a variety of engineering and physics settings, I had never been able to retain it in my memory. In this post, I would like to collect my thoughts on their differential geometric treatment, which assumes a very simple and memorable form once we introduce the language of symplectic geometry.","title":"Classical Mechanics on Manifolds"},{"content":"Here, I solve Problem $5-5$ from John Lee\u0026rsquo;s book on Riemannian Manifolds, which demonstrates the non-flatness of the ($2$-)sphere. This problem is particularly interesting because it serves as the motivating example for a later chapter in the book on curvature. As a by-product, we will put to rest the concerns of flat-Earthers.\nConnections First, I go over the technical tools needed to state (let alone solve) the problem. A connection $\\nabla$ on a smooth manifold $M$ is a way of differentiating vector fields (and more generally, tensor fields) along curves in $M$:\n\\[ \\nabla : \\mathfrak X(M) \\times \\mathfrak X(M) \\to \\mathfrak X(M) \\] where $\\mathfrak X(M)$ denotes the space of smooth vector fields on $M$. The axioms/properties that must be satisfied by a connection (specifically, a Koszul connection) can be found in Lee\u0026rsquo;s book, so I will focus on the insights.\nWe write $\\nabla X: \\mathfrak X(M) \\rightarrow \\mathfrak X(M)$ to denote the total covariant derivative of $X$, defined such that $\\nabla X(Y) = \\nabla _Y X$ inserts $Y$ into the first slot of $\\nabla$. Compare this with the map $\\nabla _X:\\mathfrak X(M) \\rightarrow \\mathfrak X(M)$ which inserts $Y$ into the second slot of $\\nabla$. The former is $C^\\infty(M)$-linear in $Y$, while the latter is $\\mathbb R$-linear but not $C^\\infty(M)$-linear in $Y$. That is, we have $C^\\infty(M)$-linearity in the first slot:\n\\[ \\nabla X (fY) = \\nabla _{fY} X = f \\nabla X (Y) \\] but a product/Leibnitz rule in the second slot:\n\\[ \\nabla _X (fY) = X(f)Y + f \\nabla _X Y \\] Since the total covariant derivative $\\nabla X$ is $C^\\infty(M)$-linear, it is expressed as a tensor, but the connection itself is not a tensor. The point is that tensors must be $C^\\infty(M)$-linear in all their arguments.\nThe Levi-Civita Connection There is a lot of freedom in choosing a connection on a Riemannian manifold. Once chosen, a connection specifies a rule for \u0026lsquo;connecting\u0026rsquo; the nearby tangent spaces (or more generally, fibers of a vector bundle) of the manifold, thereby defining a notion of differentiation of vector fields. The fundamental theorem of Riemannian geometry says that there exists a unique connection, called the Levi-Civita connection, that satisfies the following properties:\nMetric Compatibility: $\\nabla _X \\langle Y, Z \\rangle = \\langle \\nabla _X Y, Z \\rangle + \\langle Y, \\nabla_X Z \\rangle $ The Torsion-Free Property/Symmetry: $ \\nabla _X Y - \\nabla _Y X = [X, Y]$ The first of these can be viewed as a Leibniz/product rule for the metric1. The second property is a bit more mysterious; I like to think that Property 2 is about the compatibility of the connection with the flows (and thus, the differential structure) of $M$, whereas Property 1 is about compatibility with the geometry. The following discussion will shed more light into torsion-free connections.\nComparison with the Lie Derivative ($\\mathcal L_X$) The connection defines a derivative of $C^\\infty (M)$ functions in the following sense:\n\\[ \\nabla _X f = \\mathcal L_X f = X(f) \\in C^\\infty (M) \\] which is also the same as $df(X)$, $d$ being the exterior derivative. So, $\\nabla _X$ and $\\mathcal L_X$ operate identically on functions. However, $\\nabla _X Y \\neq \\mathcal L_X Y$ (in general) when $Y \\in \\mathfrak X(M)$. This was for me one of the biggest conceptual difficulties while self-studying Riemannian geometry, so it is worth dwelling on the point.\nRecall that\n\\[ \\mathcal L_X Y(f) = [X, Y](f) = X\\big(Y(f)\\big) - Y\\big(X(f)\\big) \\] which means that $\\mathcal L_X Y = [X, Y]$ is simply the Lie bracket of vector fields. We observe the following points of departure between $\\nabla _X Y$ and $\\mathcal L_X Y$:\nThe Lie derivative is well-defined once a smooth structure for $M$ is specified, whereas the connection is an additional structure that is imposed on $M$; in particular, endowing $M$ with geometry uniquely specifies the Levi-Civita connection. Importantly, there is no natural choice of a connection on a smooth manifold until a metric is chosen for it\nThe Lie derivative $\\mathcal L_X Y$ at $p\\in M$ is defined via local flows/diffeomorphisms of $X$ near $p$, whereas $\\nabla _X Y$ only depends on the vector $X_p \\in T_p M$\n$X \\mapsto \\mathcal L_X$ is only an $\\mathbb R$-linear map (and therefore, not a tensor), whereas $X \\mapsto \\nabla_X$ is $C^\\infty(M)$-linear2. Neither of $\\mathcal L_X(\\hspace{1pt}\\cdot\\hspace{1pt})$ and $\\nabla_X(\\hspace{1pt}\\cdot\\hspace{1pt})$ is $C^\\infty(M)$-linear\nIf $\\nabla$ is torsion-free, then the torsion-free condition can be rearranged to yield\n\\[ \\nabla_ X Y = \\big(\\nabla X + \\mathcal L_ X\\big) Y \\] The first of these is a $C^\\infty(M)$-linear transformation of $Y$, so $\\nabla _X$ can be viewed as an affine transformation, with $\\mathcal L_X$ playing the role of translation. The presence of $\\mathcal L_X$ destroys the linearity of $\\nabla _X$, until and unless $X$ and $Y$ commute, in which case $\\mathcal L_X Y = [X, Y] \\equiv 0$. Tangential Connection In general, the connection is given with respect to a smooth local frame $(\\{E_i\\})_{i=1}^n$ on $U\\subseteq M$ by $\\nabla _{E_i} E_j = \\Gamma_{ij}^k E_k$, where $\\Gamma_{ij}^k$ are called the Christoffel symbols of the connection with respect to $(E_i)_{i=1}^n$, and Einstein summation is implied. The Christoffel symbols specify the action of $\\nabla$ on arbitrary vector fields on $U$:\n\\[ \\nabla_X Y = X(Y^i) E_i + Y^i \\nabla_X E_i = \\big(X(Y^k) + X^i Y^j \\Gamma_{ij}^k \\big)E_k \\] where $Y = Y^i E_i$ is the expansion of $Y$ in the frame $(E_i)_{i=1}^n$. The Christoffel symbols $\\Gamma_{ij}^k$ of the Levi-Civita connection can be expressed using the second Christoffel identity (which relies on coordinates) or by using the Koszul formula (which is coordinate-free); see Corollary $5.11$ in Lee\u0026rsquo;s book (on Riemannian manifolds).\nSimilar to how the Whitney embedding theorem allows us to embed $M$ in $\\mathbb R^{2n},$ a theorem attributed to John Nash says that any Riemannian manifold can be isometrically embedded in a higher-dimensional Euclidean space. Sometimes, a more straightforward way to work with the Levi-Civita connection is to isometrically embed $M$ in a higher-dimensional Euclidean space, and then (in a sense) pull back the Euclidean connection to $M$. This is considered the extrinsic approach and comes at the risk of obscuring the intrinsic structure of $M$. Let $\\mathbb R^{\\bar n}$ be the Euclidean $\\bar n$-space having the Euclidean metric $\\bar g$. Its Levi-Civita connection is given by\n\\[ \\overline \\nabla_X Y = X(Y^i) \\frac{\\partial}{\\partial x^i} \\] which is precisely how a naive calculus student would differentiate vector fields \u0026ndash; the Riemannian geometer should be aware that this formula only holds because the Christoffel symbols are all $0$ in this case.\nNow, let $\\pi: T_p \\mathbb R^{\\bar n} \\rightarrow T_p M$ be the orthogonal projection of tangent vectors of $\\mathbb R^{\\bar n}$ onto $T_p M$. Then, the Levi-Civita connection on $M$, denoted $\\nabla$, is given by\n\\[ \\nabla_X Y = \\pi \\left(\\overline \\nabla_{\\overline X} \\overline Y\\right) \\] where $\\overline X$ and $\\overline Y$ are smooth extensions of $X$ and $Y$ to an open set of $\\mathbb R^{\\bar n}$, respectively. This paints an extrinsic picture of the covariant derivative: it is the rate of change of $Y$ along $X$ as seen from within $M$, i.e., changes in $Y$ that are normal to $M$ are disregarded.\nWe are ready to attempt Problem $5-5$ from Lee\u0026rsquo;s book.\nThe Sphere A vector field $X$ is said to be a parallel vector field if $\\nabla X$ is identically $0$. A (local) parallel coordinate frame (see my other post for an introduction to coordinate frames) exists on (an open set of) a manifold $M$ if and only if $M$ is (locally) flat. Here, local or global flatness refers to the property of being locally or globally isometric, resp., to an open set of the Euclidean space of the same dimension.\nA $2$-dimensional manifold is locally flat if and only if a small piece (i.e., an open set) cut out from it can be laid flat on the table without stretching, tearing, or crumpling it. The surface of a Cylinder and the Möbius strip (with a certain, natural choice of metric for it) are examples of locally flat surfaces; this is exactly why one can construct these surfaces from a piece of paper. The crux of Problem $5-5$ is to show that such a local parallel vector field does not exist on the sphere $S^2$. (This can be compared to the difficulty of drawing a map of the Earth that does not introduce distortions of some sort.)\nFirst, Lee asks us to consider the following embedding of the $2$-sphere in $\\mathbb R^3$:\n\\[ \\begin{aligned} \\psi: U \u0026\\rightarrow \\mathbb R^3 \\\\ (x^1, x^2) \u0026\\mapsto \\left(\\sin x^1 \\cos x^2, \\sin x^1 \\sin x^2, \\cos x^1\\right) \\end{aligned} \\] where $U\\subseteq S^2$ is an open set on the sphere and $(x^1, x^2)$ are the usual spherical polar coordinates on the sphere; technically, $\\psi$ is the pullback of an embedding under the spherical polar parameterization:\nEmbedding of the sphere in $\\mathbb R^3$ In the illustration, we use a \u0026lsquo;wobbly sphere\u0026rsquo; to remind us that $S^2$ does not necessarily come endowed with geometry. Rather, by treating $\\hat \\psi$ as an isometry, we will \u0026lsquo;pull back\u0026rsquo; the geometry of $\\mathbb R^3$ onto $S^2$. To this end, consider the frame $\\big(\\frac{\\partial}{\\partial x^1}, \\frac{\\partial}{\\partial x^2}\\big)$ on $\\mathbb R^2$3. We know that $\\varphi ^{-1}$ pushes this forward, yielding the local coordinate frame $(\\varphi ^{-1}_*\\frac{\\partial}{\\partial x^1},\\varphi ^{-1}_*\\frac{\\partial}{\\partial x^2})$ on $S^2$. It can then be pushed forward to $\\mathbb R^3$, giving us\n\\[ \\Big(\\hat \\psi_*\\big(\\varphi ^{-1}_*\\frac{\\partial}{\\partial x^1}\\big), \\hat \\psi_*\\big(\\varphi ^{-1}_*\\frac{\\partial}{\\partial x^2} \\big) \\Big) = \\Big(\\psi_*\\frac{\\partial}{\\partial x^1}, \\psi_*\\frac{\\partial}{\\partial x^2} \\Big). \\] Denote the standard frame in $\\mathbb R^3$ by $(\\frac{\\partial}{\\partial y^1}, \\frac{\\partial}{\\partial y^2}. \\frac{\\partial}{\\partial y^3})$. We have (using the usual chain rule),\n\\[ \\begin{aligned} X \\coloneq \\psi_* \\frac{\\partial}{\\partial x^1} \u0026= \\cos x^1 \\cos x^2 \\frac{\\partial}{\\partial y^1} + \\cos x^1 \\sin x^2 \\frac{\\partial}{\\partial y^2} - \\sin x^1 \\frac{\\partial}{\\partial y^3}, \\\\ Y \\coloneq \\psi_* \\frac{\\partial}{\\partial x^2} \u0026= -\\sin x^1 \\sin x^2 \\frac{\\partial}{\\partial y^1} + \\sin x^1 \\cos x^2 \\frac{\\partial}{\\partial y^2}. \\end{aligned} \\] So, there is a good reason for distinguishing between $(x^1, x^2)$ and $(y^1, y^2, y^3)$. The second of these vector fields is $Y = -y^2 \\frac{\\partial}{\\partial y^1} + y^1 \\frac{\\partial}{\\partial y^2}$, and the first is more complicated to write in terms of $(y^1, y^2, y^3)$.\nI forgot the chain rule again! My favorite thing about differential geometry is that it does not rely on memorization of formulae as much as multivariable calculus did. Instead, most of the heavy-lifting is done by the conceptual ideas, which, once internalized, can be applied to a variety of scenarios in order to recover most of the classical formulae of calculus. Some work is also done by the clever choice of notation. Let $y^1, y^2,$ and $y^3$ be the functions from $\\mathbb R^3$ to $\\mathbb R$ that project a point to its coordinates. The pushforward vector field $\\psi_* \\frac{\\partial}{\\partial x^1}$ can eat the function $ y^1(x^1, x^2)$, giving us\n\\[ \\begin{aligned} \\left[\\psi_* \\frac{\\partial}{\\partial x^1}\\right] (y^1) \u0026= \\frac{\\partial}{\\partial x^1} (y^1 \\circ \\psi) = \\frac{\\partial}{\\partial x^1} \\left(\\sin x^1 \\cos x^2\\right)\\\\ \u0026= \\cos x^1 \\cos x^2. \\end{aligned} \\] On the other hand, any vector field in $\\mathbb R^3$ can be expressed as $X^i \\frac{\\partial}{\\partial y^i}$, so that we can let this vector field act on the function $y^1$ to get $ X^i \\frac{\\partial}{\\partial y^i}(y^1) = X^1$. In fact, $X^1(x^1, x^2)$ is precisely the function $\\cos x^1 \\cos x^2$. We have effectively used the chain rule, though it is barely recognizable here in its intrinsic form.\nAre $X$ and $Y$ Parallel? If they are, then we will conclude that the sphere is flat.\nObserve that $X$ are the longitudinal (i.e., vertical) vector fields on the sphere, and $Y$ are the latitudinal (i.e., horizontal) vector fields. Lee asks us to compute $\\nabla_{X} X$ and $\\nabla_{Y} X$ to assess the parallel-ness of $X$, with $\\nabla$ being the tangential connection. I tried this using the standard (Cartesian) coordinate frame of $\\mathbb R^3$, but the computations are cumbersome. For instance, we have\n\\[ \\begin{align*} \\overline \\nabla_{\\overline Y} {\\overline X} \u0026= {\\overline Y}({\\overline X}^1) \\frac{\\partial}{\\partial y^1} + {\\overline Y}({\\overline X}^2) \\frac{\\partial}{\\partial y^2} + {\\overline Y}({\\overline X}^3) \\frac{\\partial}{\\partial y^3}\\\\ \u0026= \\overline Y\\left(\\frac{y^3 y^1}{\\sqrt{{y^1}^2 + {y^2}^2}}\\right)\\frac{\\partial}{\\partial y^1} + \\overline Y\\left(\\frac{y^3 y^2}{\\sqrt{{y^1}^2 + {y^2}^2}}\\right)\\frac{\\partial}{\\partial y^2} \\\\ \u0026\\qquad + \\overline Y\\left(-\\sqrt{{y^1}^2 + {y^2}^2}\\right)\\frac{\\partial}{\\partial y^3}\\\\ \u0026= \\frac{-y^2 y^3}{\\sqrt{{y^1}^2 + {y^2}^2}}\\frac{\\partial}{\\partial y^1} + \\frac{y^1 y^3}{\\sqrt{{y^1}^2 + {y^2}^2}}\\frac{\\partial}{\\partial y^2} \\end{align*} \\] Here, I considered the map $(x^1,x^2) \\mapsto (x^3 \\sin x^1 \\cos x^2, x^3 \\sin x^1 \\sin x^2, x^3 \\cos x^1)$ in order to define extensions of $X$ and $Y$ to open sets of $\\mathbb R^3$, yielding $\\overline X$ and $\\overline Y$, which is the first step in the computation of the tangential connection. Even without projecting onto the sphere (recall the definition of a tangential connection), we see that $\\overline \\nabla_{\\overline Y} {\\overline X}$ is $0$ when $y^3=0$, meaning that $X$ is parallel along the equator of the sphere. The computation of $\\overline \\nabla_{\\overline X} {\\overline X}$ can also be done in this way, but it will be messy. Let\u0026rsquo;s look for a better approach.\nBefore we get there, what do you think $\\nabla _X X$ will turn out to be? Here\u0026rsquo;s a hint: the integral curves of $X$ are said to be geodesics (locally length-minimizing paths) if and only if $X$ is parallel along its integral curves, i.e., $\\nabla _X X = 0$. What are the geodesics of the sphere? You might know these as the paths that airplanes like to take when flying long distances.\nUsing a Coordinate Frame for $S^2$ The frame $\\left(\\frac{\\partial}{\\partial y_1}, \\frac{\\partial}{\\partial y_2}, \\frac{\\partial}{\\partial y_3}\\right)$ is a global coordinate frame for $\\mathbb R^3$ derived from the standard/trivial coordinates, $(y^1,y^2, y^3) \\mapsto (y^1,y^2, y^3)$. It is attractive to us because it is orthonormal (the metric tensor coefficients are $\\delta_{ij}$, making the metric reduce to a dot product), and the Christoffel symbols $\\lbrace\\Gamma_{ij}^k\\rbrace_{i,j,k=1}^3$ of the Levi-Civita connection $\\overline \\nabla$ are all identically $0$.\nLet\u0026rsquo;s construct a different coordinate frame for $\\mathbb R^3$. We consider the spherical polar parametrization of $\\mathbb R^3$, $(\\varphi, \\theta, r) \\mapsto (r \\sin \\varphi \\cos \\theta, r \\sin \\varphi \\sin \\theta, r \\cos \\varphi)$. This induces a coordinate frame $(\\partial_\\varphi, \\partial_\\theta, \\partial_r)$ with $\\partial_\\varphi\\vert_{r=1}=X$ and $\\partial_\\theta\\vert_{r=1}=Y$; the vertical line \u0026lsquo;$\\vert$\u0026rsquo; is read as \u0026lsquo;restricted to\u0026rsquo;. This frame is not defined along the $y^3$ axis, but that\u0026rsquo;s okay \u0026ndash; it suffices as a local coordinate frame.\nIt seems as if this should simplify the calculation greatly, so (by an application of the no free lunch theorem ) there must be a catch. The catch is that the frame $(\\partial_\\varphi, \\partial_\\theta, \\partial_r)$ is not orthonormal, so we need to compute the Christoffel symbols of the Levi-Civita connection with respect to this frame.\nChapter 7 of Lee shows that a (local) orthonormal coordinate frame can exist if and only if the manifold is (locally) flat, so $X$ and $Y$ could not possibly be orthonormal (on some open set). This is yet another way of demonstrating the non-flatness of the sphere; we just happen to be pursuing the equivalent condition of the existence of a parallel frame. Let\u0026rsquo;s look at the metric tensor coefficients $g_{ij}$. We have, $g_{11}=\\langle \\partial_\\varphi, \\partial_\\varphi \\rangle$, with\n\\[ \\begin{align} \\partial_\\varphi \u0026= r \\cos \\varphi \\cos \\theta \\frac{\\partial}{\\partial y^1} + r \\cos \\varphi \\sin \\theta \\frac{\\partial}{\\partial y^2} - r \\sin \\varphi \\frac{\\partial}{\\partial y^3}\\\\ \\partial_\\theta \u0026= -r \\sin \\varphi \\sin \\theta \\frac{\\partial}{\\partial y^1} + r \\sin \\varphi \\cos \\theta \\frac{\\partial}{\\partial y^2}\\\\ \\partial_r \u0026= \\sin \\varphi \\cos \\theta \\frac{\\partial}{\\partial y^1} + \\sin \\varphi \\sin \\theta \\frac{\\partial}{\\partial y^2} + \\cos \\varphi \\frac{\\partial}{\\partial y^3} \\end{align} \\] so that $g_{11}(\\varphi, \\theta, r) = r^2$. Completing the metric tensor, we find that $g_{22}(\\varphi, \\theta, r) = r^2 \\sin^2 \\varphi$ and $g_{33}(\\varphi, \\theta, r) = 1$. Amazingly (but as expected), $g_{12}$, $g_{13}$, $g_{23}$, and their symmetric counterparts are all identically $0$.4\nNotice that we are done with the \"ambient\" coordinate frame of $\\mathbb R^3$; we needed it in order to pull back the Euclidean metric $\\bar {\\textrm g}$ from $\\mathbb R^3$ to $S^2$ using the inclusion map $\\widehat{\\psi}:S^2 \\rightarrow \\mathbb R^3$, which we have just done: \\[ \\big[\\widehat{\\psi}^* (\\bar {\\textrm g})\\big](\\partial_{\\varphi}, \\partial_{\\theta}) = \\bar {\\textrm g}\\big(\\widehat{\\psi}_* (\\partial_{\\varphi}), \\widehat{\\psi}_*(\\partial_{\\theta})\\big). \\] Hereafter, we focus on our new frame $(\\partial_\\varphi, \\partial_{\\theta}, \\partial_r)$ on $S^2$, which we also write as $(\\partial_1, \\partial_2, \\partial_3)$ for the purpose of Einstein summation.\nThe Christoffel symbols in a coordinate frame are given by the formula\n\\[ \\begin{align} \\Gamma_{ij}^k \u0026= \\frac{1}{2} g^{kl} \\left( \\partial_i g_{jl} + \\partial_j g_{il} - \\partial_l g_{ij} \\right) \\end{align} \\] where $g^{ij}$ are coefficients of the inverse of the metric tensor (see p. 123 of Lee). So, we have\n\\[ \\begin{align*} \\Gamma_{22}^1 \u0026= \\frac{1}{2} g^{11} \\left( \\partial_2 g_{21} + \\partial_2 g_{21} - \\partial_1 g_{22}\\right) + 0 + 0 \\\\ \u0026= -\\frac{1}{2 r^2}\\partial_{\\varphi}\\left(r^2 \\sin^2 \\varphi\\right) = -\\sin \\varphi \\,\\cos \\varphi. \\end{align*} \\] and similarly,\n\\[ \\begin{align} \\Gamma_{11}^3 = -r, \\quad \\Gamma_{12}^2=\\frac{\\cos \\varphi}{\\sin \\varphi}, \\quad \\Gamma_{13}^1 = \\frac{1}{r}. \\end{align} \\] giving us\n\\[ \\begin{align} \\overline \\nabla_{\\overline X} \\overline X \u0026= \\overline \\nabla_{\\partial_\\varphi} \\partial_\\varphi \\\\ \u0026= \\overline \\nabla_{\\partial_1} \\partial_1 =\\Gamma_{11}^3 \\partial_3 = -r \\partial_r \\end{align} \\] Since $\\overline \\nabla_{\\overline X} \\overline X$ is purely in the radial direction, we have that $\\nabla _X X = \\pi \\left(\\overline \\nabla_{\\overline X} \\overline X\\right) =0$. Thus, $X$ is indeed parallel along its own integral curves; the integral curves of $X$ are segments of the geodesics (i.e., great circles) passing through the poles.\nLastly, we compute $\\nabla _Y X$ in the new frame:\n\\[ \\begin{align} \\overline \\nabla_{\\overline Y} \\overline X \u0026= \\overline \\nabla_{\\partial_\\theta} \\partial_\\varphi \\\\ \u0026= \\overline \\nabla_{\\partial_2} \\partial_1 = \\Gamma_{21}^2 \\partial_2 = \\frac{\\cos \\varphi}{\\sin \\varphi} \\partial_\\theta\\\\ \\Rightarrow \\nabla_{Y} X \u0026= \\pi \\left( \\overline \\nabla_{\\overline Y} \\overline X\\right )=\\frac{\\cos \\varphi}{\\sin \\varphi} \\partial_\\theta \\end{align} \\] where we used the fact that $\\Gamma_{21}^2=\\Gamma_{12}^2$ for the Levi-Civita connection when expressed in a coordinate frame. As we deduced before, $X$ is only parallel in the horizontal/latitudinal directions if $\\cos \\varphi =0$, i.e., at the equator. In conclusion, $\\nabla X \\neq 0$.\nCould there be another parallel frame? We have only shown that $(X,Y)$ is not a parallel frame, which does not preclude the existence of another parallel coordinate frame. Lee asks us to consider the point $(y^1, y^2, y^3)=(1,0,0)$ in $\\mathbb R^3$, and show that there is no parallel coordinate frame near this point. By symmetry, the argument would extend to other points on the sphere. I omit the arguments here, but page $194$ of Lee\u0026rsquo;s book outlines the process of constructing a parallel frame near $p=(1,0,0)$. The key idea is that such an attempt to construct a parallel vector field would both (i) coincide with $X$, and (ii) fail to be parallel.\nSome Stronger Claims In the above, we worked with a specific isometric embedding of the sphere in $\\mathbb R^3$, restricting our investigation to a specific choice of a metric. Could another metric be prescribed for the sphere that makes it flat (or equivalently, enables us to prescribe a local orthonormal/parallel coordinate frame)? Firstly, we belabor the point that flatness is dependent on the metric at all.\nThe non-flat torus: The 2-torus $\\mathbb T^2 \\cong S^1 \\times S^1$ can be embedded in $\\mathbb R^3$ as a donut: this is not flat, because parallel transport of a vector from inside the hole (i.e., close to the axis) of the donut to the outside (i.e., far from the axis) would need to preserve its length, posing an obstacle to the construction of a parallel coordinate frame. Equivalently, one cannot wrap a piece of paper around a donut (unless the length or width of the piece of paper were vanishingly small) without introducing creases or folds in the paper.\nThe flat torus: At the same time, the 2-torus can be embedded in $\\mathbb R^4$ (analogous to how a circle can be embedded in $R^2$5) or realized as the quotient manifold $\\mathbb R^2/\\mathbb Z^2$. The latter two constructions would give rise to natural choices of flat metrics. We conclude that $\\mathbb T^2$ admits a flat metric, but the donut (which is $\\mathbb T^2$ with a specific choice of metric) is not flat. That being said, it is technically(?) possible to isometrically embed the flat torus in $\\mathbb R^3$; these researchers are working on visualizing such an embedding, and observe that it has a self-similar, fractal-like structure, indicating that the situation is more intricate than our naive conception of the donut.\nThe locally flat sphere: The sphere does admits a metric that is locally flat (near a given point, $p\\in \\mathbb S^2$). To see this, we take a cube in $\\mathbb R^3$ and smooth all of its edges and corners (to make it smooth); this smoothed cube is now a Riemannian embedding of the sphere in $\\mathbb R^3$, and its induced metric is certainly (locally) flat near some of the points. So, a more appropriate question to ask is whether the sphere can be globally flat, i.e., does it admit a metric that is flat in every neighborhood, even near the \u0026rsquo;edges\u0026rsquo; of the smoothed cube?\nThe (non-existence of a) globally flat sphere: Surprisingly, the answers to such questions come from algebraic geometry, having their roots in combinatorics (i.e., counting arguments). The Euler characteristic of the sphere, combined with the Gauss-Bonnet theorem can be used to show that the sphere does not admit a globally flat metric. Here\u0026rsquo;s another proof that relies on a theorem that may be familiar (or at least accessible) to most readers. Let $(S^2, g)$ be a hypothesized flat metric on the sphere. There exists then, a global parallel vector field $X$. By metric compatibility of the corresponding Levi-Civita connection, we have\n\\[ \\nabla_Y \\langle X,X \\rangle = \\langle \\nabla_Y X, X \\rangle + \\langle X, \\nabla_Y X \\rangle = 0 \\] This means that if $\\langle X, X\\rangle _p \\neq 0$ at some point $p$, then $\\langle X, X\\rangle$ (and therefore, $X$) does not vanish anywhere on $S^2$. The hairy ball theorem says that this is impossible \u0026ndash; a contradiction!\nA Word of Caution! The most confusing part of self-learning Riemannian geometry has been, for me, the conflation between frames and coordinate frames. I suggest that the reader familiarizes theirself with the definition of either object; a coordinate chart induces a frame, but not every frame is the frame derived from a coordinate chart. Every time you encounter a formula, statement, or theorem, take care to check whether it only holds for a coordinate frame.\nThe following examples show why this distinction is important:\nThe left-invariant vector fields $(E_i)_{i=1}^n$ of a Lie group form a global, orthonormal (after carrying out the Gram-Schmidt process) frame. Does this mean that every Lie group is globally flat? No, because the frame of left-invariant vector fields on a non-Abelian Lie group is not a coordinate frame; to see this, observe that $[E_i, E_j] \\neq 0$, whereas the Lie brackets must vanish in a coordinate frame since the partial derivatives of $\\mathbb R^n$ commute.6 It is only an orthonormal coordinate frame that is indicative of flatness.\nFor the same reason, it is seldom possible to choose an orthonormal coordinate frame. One either gives up orthonormality (choosing to introduce a metric tensor) or gives up the coordinate frame (choosing to work with a frame that is not a coordinate frame, such as $(E_i)_{i=1}^n$).\nThe identity we used to compute the Christoffel symbols in this post only holds in a coordinate frame. In a non-coordinate frame, we would replace it with the appropriate formula in Corollary 5.11 of Lee\u0026rsquo;s book.\nThe metric compatibility condition is equivalent to the condition that $\\nabla g \\equiv 0$, though to define what this even means, we need to understand how the connection on $TM$ induces a connection on symmetric $(0,2)$-tensors on $M$ (of which the metric tensor $g$ is an example).\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nNote that $\\mathcal L_{fX} Y = [fX,Y] = -\\mathcal L_Y (fX)$, so the Lie derivative satisfies a Leibniz/product rule in either of its arguments.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nThis frame looks like a net/mesh on $\\mathbb R^2$ but gets warped when it\u0026rsquo;s pushed forward to $S^2$.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nI find it remarkable that this frame is orthogonal but not orthonormal (and moreover, not constant), and that in itself is enough to pose an obstruction to flatness.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nNote that $S^1$ (and in fact, all one-dimensional smooth manifolds) is flat independently of the choice of metric; a piece of string/yarn can always be curved into another shape of desire. Consequently, the product metric on $S^1\\times S^1$ is flat as well.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nThe left-invariant vector fields do induce coordinates on a connected Lie group, called the exponential coordinates: $(x^1, \\cdots, x^n) \\mapsto \\exp \\left(\\sum_{i=1}^n x^i (E_i)_e\\right)$. However, even for the matrix exponential, the sum inside $\\exp$ splits into a product of $\\exp$s only when $(E_1, \\cdots, E_n)$ commute with each other, i.e., when their pairwise Lie brackets vanish.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://shirazkn.github.io/posts/sphere/","summary":"Here, I solve Problem 5-5 from John Lee\u0026rsquo;s book on \u003cem\u003eRiemannian Manifolds\u003c/em\u003e, which demonstrates the non-flatness of the 2-sphere. This problem is particularly interesting because it serves as the motivating example for a later chapter in the book on curvature.","title":"The Levi-Civita Connection"},{"content":"A Lie group $G$ is a group that is also a (continuous, differentiable) topological space. An example to keep in mind is $G=\\mathbb R^n$ which is a group under vector addition and has well-defined notions of continuity and differentiation. To measure lengths and volumes (and relatedly, to define and integrate probability densities) we need to endow $G$ with additional structure so that it is not merely a manifold, but a Riemannian manifold. Luckily for us, we only need to define an inner product for the Lie algebra, after which there is a natural definition of length and volume that can be made for the entire group manifold. I say that the resulting choice of volume (called the Haar measure) is natural because it is compatible with the group structure of $G$ as well as its differential structure as a manifold. This can be compared to how the standard notion of volume for $\\mathbb R^n$, the Lebesgue measure, is compatible with vector addition; we have for a (measurable) set $A\\subseteq \\mathbb R^n$ and for every $\\mathbf x_0 \\in \\mathbb R^n$,\n\\[ \\textrm{Vol}(A) = \\textrm{Vol}(\\mathbf x_0 + A), \\] where \\[ \\mathbf x_0 + A \\coloneqq \\lbrace\\, \\mathbf x_0 + \\mathbf x \\ \\vert\\ \\mathbf x \\in A \\,\\rbrace. \\] Thus begins our journey into making sense of this compatibility in the general context of a Lie group. If you are seeking a more application-oriented approach and/or aren\u0026rsquo;t all that interested in this sort of abstraction, this book has all the formulae worked out, and my previous post introduces the idea of invariant metrics and measures on Lie groups. Note that I will be using the Einstein summation convention throughout.\nLengths and Volumes ✨ Let $M$ be an $n$-dimensional smooth manifold. A covariant $2$-tensor field on $M$ is a bilinear map that takes $2$ smooth vector fields as its arguments and produces a smooth $C^\\infty(M)$ function.1 A Riemannian metric is a covariant $2$-tensor field that is symmetric:\n\\[ \\begin{align} g(\\mathbf{v}_1, \\mathbf{v}_2) \u0026= g(\\mathbf{v}_2, \\mathbf{v}_1) \\in C^\\infty(M), \\end{align} \\] and positive-definite: \\[ \\begin{align} g(\\mathbf{v}_1, \\mathbf{v}_1)(p) \u003e 0 \u0026\\quad \\Leftrightarrow \\quad \\mathbf{v}_1(p) \\neq \\mathbf 0. \\end{align} \\] where $\\mathbf{v}_1, \\mathbf{v}_2 \\in \\mathfrak X(M)$ are smooth vector fields on $M$. At some point $p\\in M$, the number $g(\\mathbf{v}_1, \\mathbf{v}_2)(p)$ is interpreted as the inner product between the tangent vectors $\\mathbf{v}_1(p)$ and $\\mathbf{v}_2(p)$, often written as $\\langle \\mathbf v_1, \\mathbf v_2 \\rangle_p \\coloneqq g(\\mathbf{v}_1, \\mathbf{v}_2)(p)$. With such a mathematical structure imposed on $M$, we call $(M,g)$ a Riemannian manifold.\nWhile a metric tensor is a symmetric covariant $2$-tensor field, a volume form $\\omega$ is an alternating covariant $n$-tensor field, also called as a differential $n$-form. By combining its alternating property with its linearity, $\\omega$ can be shown to be antisymmetric in its arguments. That is, if $\\mathbf{v}_1,\\mathbf{v}_2, \\cdots, \\mathbf{v}_n \\in \\mathfrak X(M)$ are smooth vector fields, then\n$$ \\omega(\\mathbf{v}_1, \\mathbf{v}_2, \\cdots, \\mathbf{v}_n) = -\\omega(\\mathbf{v}_2, \\mathbf{v}_1, \\cdots, \\mathbf{v}_n) = \\nu \\in C^\\infty(M) $$The function $\\nu$ that is spit out by $\\omega$ (after it eats $n$ vector fields) assigns the volume $\\nu(p)\\in \\mathbb R$ to the parallelopiped spanned by the vectors $\\mathbf{v}_1(p), \\mathbf{v}_2(p), \\cdots, \\mathbf{v}_n(p)\\in T_pM$. Thus, $\\omega$ is sort of like a \u0026lsquo;volume meter\u0026rsquo; affixed to each point of $M$. It is the authority on what counts as a positive volume, what counts as a small or a large volume, and so on. Its antisymmetry can be compared with the fact that $\\int_a^b f(x) dx= -\\int_b^a f(x) dx$.\nEither of these maps can be written as a tensor in local coordinates on $U\\subseteq M$, e.g., $g=g_{ij}dx^i \\otimes dx^j$, where the Einstein summation convention is used and $g_{ij}\\in C^\\infty (U)$. Since the standard tensor notation doesn\u0026rsquo;t reflect the symmetry/antisymmetry of the tensors, one typically drops it in favor of notation that does:\n\\[ \\begin{align} g \u0026= g_{i_1 i_2}dx^{i_1} dx^{i_2}\\\\ \\omega \u0026= \\omega_{i_1 i_2 \\cdots i_n} dx^{i_1} \\wedge dx^{i_2} \\wedge \\cdots \\wedge dx^{i_n} \\end{align} \\] This helps us remember that the $dx^{i_1}$ and $dx^{i_2}$ of $g$ can be swapped without consequence, whereas swapping the $dx^{i_1}$ and $dx^{i_k}$ of $\\omega$ may or may not incur a sign-change depending on the parity of the permutation.\nHere, $dx^i dx^j = \\textrm{Sym}(dx^i \\otimes dx^j) \\coloneqq \\frac{1}{2}(dx^i \\otimes dx^j + dx^j \\otimes dx^i)$ denotes the symmetric product of tensors $-$ it is defined in a way that forces the resulting tensor to be symmetric in its arguments. Similarly, '$\\wedge$' denotes the alternating (or wedge, or exterior) product of tensors, thereby capturing the antisymmetry of a differential form. With this notation, observe that if $\\mathbf{v} = v^k \\frac{\\partial}{\\partial x^k}$ and $\\mathbf{w} = w^l \\frac{\\partial}{\\partial x^l}$ are vector fields on $U$, where $(\\frac{\\partial}{\\partial x^i})_i$ is the dual basis of $(dx^i)_{i}$, then\n\\[ \\begin{align} g(\\mathbf{v},\\mathbf{w}) \u0026= g_{ij} dx^i dx^j \\left(v^k \\frac{\\partial}{\\partial x^k}, w^l \\frac{\\partial}{\\partial x^l}\\right) \\\\ \u0026= g_{ij}v^k w^l \\ \\left[dx^i dx^j \\left(\\frac{\\partial}{\\partial x^k}, \\frac{\\partial}{\\partial x^l}\\right)\\right] \\\\ \u0026= g_{ij} v^k w^l \\delta^i_k \\delta^j_l \\\\ \u0026= g_{ij} v^i w^j. \\end{align} \\] Since it is a (pointwise) sum of (pointwise) products of $C^\\infty(U)$ functions, $g_{ij} v^i w^j\\in C^\\infty(U)$. The metric tensor coefficients $(g_{ij})_{i,j=1}^n$ play a role similar to that of a weighting matrix $W \\in \\mathbb R^{n \\times n}$ that is introduced when defining a non-standard inner product in $\\mathbb R^n$, as $\\langle \\mathbf{v}, \\mathbf{w}\\rangle \\coloneqq \\mathbf{v}^\\top W \\mathbf{w}$.\nFrames ✨ In the above expressions, we used a frame (a system of vector fields) that arises from a coordinate chart. However, there may arise situations where we prefer to work with a frame that is not only not induced by a coordinate chart, but also cannot be induced by a coordinate chart. The frame of left-invariant vector fields on a (non-Abelian) Lie group is a prime example of this.\nA local frame in an open set $U$ of $M$ is a set of tangent vector fields in $\\mathfrak X(U)$, enumerated as $(E_i)_{i=1}^n$, such that $(E_i(p))_{i=1}^{n}$ is a basis of $T_p M$ for all $p\\in U\\subseteq M$. A local frame is orthonormal if $g(E_i, E_j) = \\delta_{ij}$, where $\\delta_{ij}$ is the Kronecker delta considered as a constant-valued $C^\\infty(U)$ function. A global frame is one that is defined on all of $M$, with $U=M$.\nA frame is said to be oriented once we order it and decree that this ordering \u0026ndash; as well as even permutations of it \u0026ndash; will be considered as being 'positively oriented'. In high-school geometry, we learn that the $\\textrm x - \\textrm y - \\textrm z$ basis is positively oriented by convention. In three dimensions, cyclic permutations like \\[ \\textrm x - \\textrm y - \\textrm z \\quad \\mapsto \\quad \\textrm y - \\textrm z - \\textrm x\\] are precisely the even-parity permutations, which is why $\\textrm y - \\textrm z - \\textrm x$ is considered to be positively oriented too. The dual coframe to $(E_i)_{i=1}^n$ is the collection of cotangent (or covariant) vector fields $(\\varepsilon^i)_{i=1}^n$, such that $\\varepsilon^i(E_j) = \\delta^i_{j}$. These cotangent vector fields form a basis for differential $1$-forms. We can take their tensor products to obtain a basis for covariant $k$-tensor fields:\n\\[ \\begin{align} \\lbrace \\varepsilon^{i_1} \\otimes \\varepsilon^{i_2} \\otimes \\cdots \\otimes \\varepsilon^{i_k}\\ \\vert\\ 1\\leq i_1, i_2, \\cdots, i_k \\leq n \\rbrace \\end{align} \\] or their exterior /wedge products to obtain a basis for the space of differential $k$-forms:\n\\[ \\begin{align} \\lbrace \\varepsilon^{i_1} \\wedge \\varepsilon^{i_2} \\wedge \\cdots \\wedge \\varepsilon^{i_k}\\ \\vert\\ 1\\leq i_1 \u003c i_2 \u003c \\cdots \u003c i_k \\leq n \\rbrace \\end{align} \\] Observe that covariant $k$-tensor fields are a more general class of objects than differential $k$-forms, since the latter have the additional, special property of being alternating. This fact is reflected in the dimensions of their bases (notice the range of the indices in either case). Despite this important distinction, $k$-forms are sometimes simply called '$k$-covectors'; a more precise term would be alternating $k$-covectors. Orthonormal Frames In the coordinate coframe $(dx^i)_{i=1}^n$, we expressed the metric tensor as $g = g_{ij}dx^i dx^j$. Let\u0026rsquo;s now try to express it in a local coframe $(\\varepsilon^i)_{i=1}^n$ on $U \\subseteq M$ that is dual to an orthonormal one:\n\\[ \\begin{align} g = g_{ij}dx^i dx^j = \\tilde g_{ij} \\varepsilon^i \\varepsilon^j \\end{align} \\] We have then, that\n\\[ \\begin{align} \\delta_{kl} = g(E_k, E_l) \u0026= \\tilde g_{ij} {\\varepsilon}^i {\\varepsilon}^j \\left(E_k, E_l\\right) \\\\ \u0026= \\tilde g_{ij} {\\varepsilon}^i \\left(E_k\\right) {\\varepsilon}^j \\left(E_l\\right) \\\\ \u0026= \\tilde g_{ij} \\delta^i_k \\delta^j_l = \\tilde g_{kl}. \\end{align} \\] This means that the metric tensor, when expressed in a coframe dual to an orthonormal one, has the trivial representation: $g = \\delta_{ij} \\varepsilon^i \\varepsilon^j$. By writing out the summation explicitly, this takes a more familiar form:\n$$ g = (\\varepsilon^1)^2 + (\\varepsilon^2)^2 + \\cdots + (\\varepsilon^n)^2. $$Given a local orthonormal frame of vector fields $(E_i)_{i=1}^n$ whose dual coframe is $(\\varepsilon^i)_{i=1}^n$, the unique (up to choice of orientation) Riemannian volume form $\\omega_g$ is given by\n\\[ \\begin{align} \\omega_g = \\varepsilon^1 \\wedge \\varepsilon^2 \\wedge \\cdots \\wedge \\varepsilon^n \\end{align} \\] so that $\\omega(E_1, E_2, \\cdots, E_n) = 1$. The above statements will look identical in any of the local orthonormal frames of $M$.\nCoordinate Frames The allure of orthonormal frames is that $g$ and $\\omega_g$ can be represented quite succinctly in them. However, the existence of an orthonormal frame that arises as the coordinate frame of a chart is a very rare occasion: such a frame only exists when the Riemannian manifold $(M, g)$ is locally flat. If we would rather work with a frame that arises from coordinates, then we must resort to computing the components of a non-flat metric tensor (one that is not simply the Kronecker delta). Also see my post on the non-flatness of the sphere .\nLet $\\frac{\\partial}{\\partial x^i}\\Big\\vert_{q}$ be the usual coordinate-wise partial derivative operators in $\\mathbb R^n$, $q \\coloneqq (q_1, q_2, \\dots, q_n)$, and $q\\in V \\subseteq \\mathbb R^n$. Given a smooth function $f\\in C^\\infty(V)$, the partial derivative operators of $\\mathbb R^n$ operate on $f$ as follows:\n$$ \\begin{align*} \\frac{\\partial}{\\partial x^2}\\bigg\\vert_{q} f = \\lim_{h\\rightarrow 0} \\frac{f(q_1, q_2+h, \\dots, q_n) - f(q_1, q_2, \\dots, q_n)}{h} \\end{align*} $$ Recalling that the job of a vector field is to map a smooth function to a real number at every point (in a smooth manner), the following object is in fact a vector field on $V$: $$\\frac{\\partial}{\\partial x^i}\\bigg\\vert_{(\\ \\cdot\\ )}:V \\rightarrow TV.$$ Thus, $\\left\\lbrace\\frac{\\partial}{\\partial x^i}\\big\\vert_{(\\ \\cdot\\ )}\\right\\rbrace_{i=1}^n$ is a set of vector fields on $V$, and may be visualized as a \u0026ldquo;fisherman\u0026rsquo;s net\u0026rdquo; spread across $V$.\nUltimately, we want a coordinate frame on a subset $U$ of $M$, rather than on $V$, which is in $\\mathbb R^n$. Let $\\varphi:U \\rightarrow V$ be a smooth chart containing some point $p$ of $M$. Its differential $(\\varphi_*)_p$ is a vector space isomorphism (i.e., an invertible linear map) between $T_p U$ and $T_q V$. The pushforward of the \u0026ldquo;partial derivative vector fields\u0026rdquo; of $V$ under $\\varphi^{-1}$ gives us a coordinate frame on $U$. That is, $\\varphi^{-1}$ maps the fisherman\u0026rsquo;s net on $V$ to one on $U$. By an abuse of notation, I and many others use $\\left(\\frac{\\partial}{\\partial x^i}\\big\\vert_{(\\ \\cdot\\ )}\\right)_{i=1}^n$ to refer to either frame; the subscript, the function being operated on, and/or the context will make it clear which frame is being used. This means that if $\\tilde f\\in C^\\infty (U)$, then we write\n\\[ \\begin{align} \\frac{\\partial}{\\partial x^i}\\Big\\vert_{p} \\tilde f \u0026= \\left((\\varphi_*^{-1})_{\\varphi(p)} \\frac{\\partial}{\\partial x^i}\\Big\\vert_{\\varphi(p)}\\right) \\tilde f\\\\ \u0026= \\frac{\\partial}{\\partial x^i}\\Big\\vert_{\\varphi(p)} \\tilde f \\circ \\varphi^{-1} = \\frac{\\partial}{\\partial x^i}\\Big\\vert_{\\varphi(p)} (\\varphi^{-1 *} \\tilde f). \\end{align} \\] where $\\varphi^{-1 *} \\tilde f$ is called the pullback of $\\tilde f$ under $\\varphi^{-1}$; it pulls the domain of $\\tilde f$ back to $V$.\nOrthonormal Coordinate Frames I reiterate that there are Riemannian manifolds where such a coordinate frame couldn\u0026rsquo;t possibly be orthonormal at all $p\\in U$. Theorem 13.14 of Lee\u0026rsquo;s Introduction to Smooth Manifolds says that this is only possible when $U$ is flat, i.e., $M$ is locally flat.\nPullbacks ✨ Let $f:M \\rightarrow N$ be a diffeomorphism between manifolds (though it is possible to generalize the forthcoming discussion to other kinds of smooth maps).\nRecall that tangent and cotangent vectors are dual to each other, and so are their $k^{th}$ exterior powers: alternating $k$-vector fields and differential $k$-forms. Whenever we have a morphism for an object going one way, we expect a dual morphism for the corresponding dual object going the other way. Using this intuition, we deduce that if $f_{\\ast}: TM \\rightarrow TN$ allows us to push forward vector fields, there must be a dual morphism $f^{\\ast}: T^{\\ast} N \\rightarrow T^{\\ast} M$ that allows us to pull back covector fields. Similarly, $f^{\\ast}: \\Omega^k(N) \\rightarrow \\Omega^k(M)$2 pulls back differential forms from $N$ to $M$ (we use the same notation for either map, $f^{\\ast}$). In particular, metrics and volume forms on $N$ can be pulled back to define metrics and volume forms on $M$.\nThe covariant tensor field thus obtained on $M$ is called as the pullback of the covariant tensor field on $N$ under $f$. For example, consider $M=S^2$ to be the unit $2$-sphere and $f$ to be its usual submersion into $N=\\mathbb R^3$. Then, the pullback of the Euclidean (\u0026lsquo;dot product\u0026rsquo;) metric $\\bar g$ of $\\mathbb R^3$ under $f$, $f^{\\ast} \\bar g$, is called the round metric, and it is a bonafide Riemannian metric for $S^2$ (I compute its components in the next post ). When the Euclidean metric is pulled back onto a submanifold $M\\subseteq \\mathbb R^3$ in this manner, the pullback metric is called the induced metric or the first fundamental form of $M$. More generally, we can pull back covariant tensor fields from arbitrary manifolds, as long as we have a smooth map onto it.\nThe pullback of a differential form is defined such that it must be in concordance with the pushforwards of vector fields. Specifically, if $\\omega \\in \\Omega^k(N)$ is a differential $k$-form and $\\mathbf{v}_1, \\mathbf{v}_2, \\cdots, \\mathbf{v}_k \\in \\mathfrak X(M)$, then\n\\[ \\begin{align} (f^*\\omega)(\\mathbf{v}_1, \\mathbf{v}_2, \\cdots, \\mathbf{v}_k) = \\omega(f_*\\mathbf{v}_1, f_*\\mathbf{v}_2, \\cdots, f_*\\mathbf{v}_k). \\end{align} \\] I like to read this as: $f^{\\ast}\\omega$ eats vector fields on $M$ by imitating how $\\omega$ might eat the corresponding pushforward vector fields on $N$. Thus, $f^{\\ast}\\omega$ is a differential $k$-form on $M$; the domain of $\\omega$ has been pulled back by $f^{\\ast}$.\nLie Groups ✨ The tangent space at the identity of a Lie group $G$ can be given one of infinitely many possible inner products. However, there is a unique way to extend this inner product to a Riemannian metric by requiring that it be compatible with the group structure of $G$ (and as a consequence, compatible with the differential structure of $G$ as a manifold). For the same reason, there is also a unique choice of volume form (or equivalently, measure) with respect to which one can define the integral. This will be called the Haar integral, and it specializes to the Lebesgue integral when $G=\\mathbb R^n$. Yet another useful property of Lie groups is that there is a way to construct global orthonormal frames for it: we choose an orthonormal basis of $T_e G$ and extend it to a set of left/right-invariant vector fields. Even among Lie groups, orthonormal coordinate frames are a rare occurrence; if $G$ is non-Abelian, then an orthonormal frame could not possibly come from a coordinate system (also see this ). Nevertheless, the fact that a global orthonormal frame exists is already quite a special property.\nLeft-Invariant Vector Fields Let $G$ be a Lie group, $e\\in G$ its identity element, and $\\mathfrak g$ its Lie algebra.3 Consider an inner product on $\\mathfrak g$, $\\langle \\cdot, \\cdot \\rangle_e$, and use the Gram-Schmidt process to construct an orthonormal basis for $\\mathfrak g$. Denote one such orthonormal basis by $(\\tilde E_i)_{i=1}^n$, where $\\tilde E_i \\in \\mathfrak g$ and $\\langle \\tilde E_i, \\tilde E_j \\rangle_e = \\delta_{ij}$. Its corresponding dual basis is denoted as $(\\tilde \\varepsilon^i)_{i=1}^n$, where $\\tilde \\varepsilon^i \\in \\mathfrak g^*$. We can then express $\\langle \\cdot, \\cdot \\rangle_e$ by the tensor $\\delta_{ij} \\tilde \\varepsilon^i \\tilde \\varepsilon^j$.\nLet $\\mathcal L_{g}:G\\rightarrow G$ denote the left-multiplication map, $\\mathcal L_{g}(h) = gh$, and similarly define $\\mathcal R_{g}$4; observe that these maps are diffeomorphisms from $G$ to $G$, and can therefore push and pull tensors and tensor fields from one point of $G$ to another. For instance, the orthonormal basis $(\\tilde E_i)_{i=1}^n$ can be extended to a global orthonormal frame $(E_i)_{i=1}^n$ on $G$:\n\\[ E_i(g) \\coloneqq \\left(\\mathcal L_{g}\\right)_* \\tilde E_i. \\] Such a global orthonormal frame on $G$ also serves as a \u0026ldquo;basis\u0026rdquo; of the space (or more rigorously, a generating set of the $C^\\infty(G)$-module) of vector fields on $G$, since any vector field $V\\in\\mathfrak X(G)$ can be uniquely expressed as $V = v^i E_i$ with $v^i \\in C^\\infty(G)$. Vector fields of the form $\\text{c}^i E_i$ (where $\\text{c}^i$ are constants) are precisely the left-invariant vector fields of $G$.\nOne can similarly extend $(\\tilde \\varepsilon^i)_{i=1}^n$ to a left-invariant global coframe $(\\varepsilon^i)_{i=1}^n$ on $G$:\n\\[ \\varepsilon^i(g) \\coloneqq \\left(\\mathcal L_{g^{-1}}\\right)^* \\tilde \\varepsilon^i. \\] Immediately, we have the following property at all $g\\in G$:\n\\[ \\begin{align} \\varepsilon^i(E_j)(g) \u0026= \\Big[\\left(\\mathcal L_{g^{-1}}\\right)^* \\tilde \\varepsilon^i\\Big]\\Big(\\left(\\mathcal L_{g}\\right)_* \\tilde E_j \\Big) \\\\ \u0026= \\tilde \\varepsilon^i\\Big(\\left(\\mathcal L_{g^{-1}}\\right)_*\\left(\\mathcal L_{g}\\right)_* \\tilde E_j \\Big) =\\tilde \\varepsilon^i\\Big(\\tilde E_j\\Big) = \\delta_{ij}. \\end{align} \\] In the following, we assume that $(E_i)_{i=1}^n$ and $(\\varepsilon^i)_{i=1}^n$ are left-invariant. Analogous arguments follow for the right-invariant case. The only caveat is that the left-invariant and right-invariant metrics and volume forms may or may not turn out to be the same, as discussed in my previous post .\nWhat is Left-Invariance? Let\u0026rsquo;s scrutinize the left-invariance of $E_i$. Pretend that the left-multiplication map $L_g$ sends $G$ to another copy of itself, denoted as $G^💧$!\nIf we view $E_i$ as a vector field on $G$, its pushforward on $G^💧$, $(\\mathcal L_{g})_*E_i$, should act on a function $f\\in C^\\infty (G^💧)$ by mimicking whatever $E_i$ would have done in its place. Given some point $h\\in G$, with $\\mathcal L_g(h) = gh \\in G^💧$, we have\n\\[ \\begin{align} \\big[(\\mathcal L_{g})_*E_i \\,f\\big](gh) \u0026= \\big[E_i(f \\circ \\mathcal L_g)\\big](h) \\end{align} \\] Moreover,\n\\[ \\begin{align} \\big[E_i(f \\circ \\mathcal L_g)\\big](h) \u0026= \\frac{d}{dt}[f\\circ \\mathcal L_g]\\big(h \\exp(t\\tilde E_i)\\big)\\Big\\vert_{t=0}\\\\ \u0026= \\frac{d}{dt}f\\big(g h\\exp(t\\tilde E_i)\\big)\\Big\\vert_{t=0}\\\\ \u0026= [E_i f](gh)\\\\ \u0026= [(E_i f)\\circ \\mathcal L_g](h). \\end{align} \\] Now let $G=G^💧$ (as we have done implicitly in the calculation above). Observe that the calculation above involved the following maps:\n\\[ \\begin{align} f: G \u0026\\rightarrow \\mathbb R\\\\ E_i f : G \u0026\\rightarrow \\mathbb R\\\\ \\end{align} \\] i.e., $f$ and its derivative. Then, we notice that we can perform either of these maps on $G^💧$ as well. That is, we do $\\mathcal L_g: G \\rightarrow G$ first, and then perform either of the above maps. This gives us two more maps:\n\\[ \\begin{align} f\\circ \\mathcal L_g: G \u0026\\rightarrow \\mathbb R\\\\ (E_i f)\\circ \\mathcal L_g : G \u0026\\rightarrow \\mathbb R\\\\ \\end{align} \\] Finally, we note that $E_i$ can act on the function $f\\circ \\mathcal L_g$, giving us yet another function\n$$E_i(f\\circ \\mathcal L_g): G \\rightarrow \\mathbb R.$$ That $E_i(f\\circ \\mathcal L_g)$ and $E_i(f)\\circ \\mathcal L_g$ are the same function, is what we showed, which is not true unless $E_i$ is left-invariant. The fact that $\\mathcal L_g$ moves in and out of the differentiation is what \u0026ldquo;left-invariant\u0026rdquo; refers to (also see the commutative square here ). An analogous property is exhibited by $\\varepsilon^i$.\nLeft-Invariance of Geometric Structure Now consider what should happen if we define the Riemannian metric of $G$ as\n\\[ \\langle \\hspace{1pt}\\cdot\\hspace{2pt},\\hspace{1pt}\\cdot\\hspace{2pt}\\rangle = \\textrm{k}_{ij}\\hspace{1pt}\\varepsilon^i \\varepsilon^j \\] where $\\textrm{k}_{ij}$ are constants that should be thought of as a \u0026ldquo;weighting matrix\u0026rdquo;. Clearly, this metric should inherit the left-invariance properties of $(\\varepsilon^i)_{i=1}^n$. Indeed, we can use similar arguments as before to show that if $\\mathbf v, \\mathbf w \\in T_gG$, then\n$$ \\langle \\mathbf v,\\mathbf w\\rangle_g = \\langle (\\mathcal L_h)_{\\ast_g}\\mathbf v, (\\mathcal L_h) _{\\ast_g}\\mathbf w\\rangle _{hg} $$To see this, we evaluate the right hand side:\n\\[ \\begin{align*} \\langle (\\mathcal L_h)_{\\ast_g}\\mathbf v, (\\mathcal L_h) _{\\ast_g}\\mathbf w\\rangle _{hg} \u0026= \\textrm{k}_{ij}\\hspace{1pt}\\varepsilon^i_{hg} \\varepsilon^j_{hg} \\big((\\mathcal L_h)_{\\ast_g}\\mathbf v, (\\mathcal L_h) _{\\ast_g}\\mathbf w\\big)\\\\ \u0026= \\textrm{k}_{ij}\\hspace{1pt}\\varepsilon^i_{hg} \\big((\\mathcal L_h)_{\\ast_g}\\mathbf v\\big) \\varepsilon^j_{hg} \\big((\\mathcal L_h) _{\\ast_g}\\mathbf w\\big)\\\\ \\end{align*} \\] Then, use the duality between pushforwards and pullbacks to show that\n\\[ \\begin{align*} \\varepsilon^i_{hg} \\big((\\mathcal L_h)_{\\ast_g}\\mathbf v\\big) \u0026= \\big[(\\mathcal L_h)^\\ast_{_{hg}}\\varepsilon^i_{hg}\\big] \\big(\\mathbf v\\big) \\\\ \u0026=\\big[(\\mathcal L_h)^\\ast_{_{hg}} (\\mathcal L_{h^{-1}})^\\ast_{_g} \\varepsilon^i_{g}\\big] \\big(\\mathbf v\\big) \\\\ \u0026= \\varepsilon^i_g \\big(\\mathbf v\\big). \\end{align*} \\] The notation may seem cumbersome, but given how light-yet-powerful the notation of differential geometry is already, it\u0026rsquo;s not too bad (depending on what you intend to do with it). Drawing a diagram involving the points $g$ and $hg$, as well as the maps $\\mathcal L_g$ and $\\mathcal L_h$, can help in understanding the above calculation.\nA left-invariant volume form can be defined as $\\omega = \\varepsilon^1 \\wedge \\varepsilon^2 \\wedge \\cdots \\wedge \\varepsilon^n$, and has analogous invariance properties.\nNote that we could also have chosen to work with a coordinate coframe on an open set $U\\subseteq G$ containing $g$ in order to express $\\langle \\cdot, \\cdot \\rangle$. In this case, we would need to compute the metric tensor coefficients since they will no longer be trivial. Evidently, if we work with coordinates on a Lie group, then we are not taking advantage of the group structure. In the next post , I compute the metric tensor coefficients for the sphere (which is not a Lie group!) in spherical polar coordinates. Some important comments about the Levi-Civita connection are made as well.\nFormally, such an object is an element of $\\Gamma (T^*M \\otimes T^*M)$, i.e., it is \u0026lsquo;a smooth section of the 2nd tensor power of the cotangent bundle of $M$\u0026rsquo;. There is a sense in which covariant $k$-tensor fields are elements of the dual space corresponding to the module of contravariant $k$-tensor fields on $M$, where instead of a field of scalars, we have a ring of $C^\\infty(M)$ functions. With this linear algebraic perspective, we recognize that a vector and its dual should combine to give a scalar.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nThe space of differential $k$-forms on $M$ is denoted by $\\Omega^k(M)$, which is also the space $\\Gamma (\\Lambda^k T^* M)$ of smooth sections of the $k^{th}$ exterior power of the cotangent bundle, $T^{\\ast}M$. Note that $\\Omega^k(M)$ is a subspace (specifically, a submodule) of the space of all the covariant $k$-tensor fields on $M$ (viewed as a $C^\\infty(M)$-module).\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nWe conflate $\\mathfrak g$ with $T_e G$ for convenience. (the latter does not come with a Lie bracket).\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nBeware: In the first half of this post, $g$ denoted the Riemannian metric, whereas in the latter half, it represents an arbitrary element of $G$.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://shirazkn.github.io/posts/lie-groups_riemannian/","summary":"A \u003cspan class=accented\u003eLie group\u003c/span\u003e is a group that is also a (continuous, differentiable) topological space. To measure lengths and volumes (and relatedly, to define and integrate probability densities) we need to endow the group with additional structure so that it is not merely a manifold, but a \u003cspan class=accented\u003eRiemannian manifold\u003c/span\u003e.","title":"Lie Groups as Riemannian Manifolds"},{"content":"There are multiple ways to construct new groups from old ones. For instance, the semidirect product $SO(3) \\ltimes \\mathbb R^3$ is the Special Euclidean group $SE(3)$, which is composed of all the rigid transformations of $\\mathbb R^3$ (minus reflections). Here, I provide an intuition for how these constructions work. I will also go over some of the additional structures that can be imposed on Lie groups, paving the path towards doing differential geometry and calculus on Lie groups. While the word geometry implies the presence of an inner product/Riemannian metric, the word calculus hints at the possibility of differentiation and integration on Lie groups.\nQuotients Coset Spaces Let $G$ be a Lie group having the operation $\\odot$, and $H\\leq G$ a subgroup of it. The left cosets of $H$ in $G$ refer to the subsets of $G$ of the form\n\\[ \\begin{align} g\\odot H = \\lbrace g \\odot h\\ \\vert\\ h\\in H \\rbrace \\subseteq G \\end{align} \\] where $g\\in G$. An important property of these subsets is that they\u0026rsquo;re disjoint, in the sense that\n\\[ \\begin{align} \\textrm{either }\\ \u0026\\left( g_1 \\odot H \\right) \\cap \\left( g_2 \\odot H \\right) = \\emptyset \\nonumber \\\\ \u0026\\text{or } \\ \\ g_1 \\odot H = g_2 \\odot H. \\end{align} \\] In general, the objects $\\left\\lbrace g\\odot H\\ \\vert\\ g\\in G\\right\\rbrace$ do not form a group; they are merely a topological space called a coset space, written as $G/H$. I find this interpretation of a coset space very unmotivated. To me, a better way to think about $G/H$ is as the topological quotient space, $G/\\sim_H$, constructed as follows.\nLet $S$ be some topological space and $\\sim$ an equivalence relation on the elements of $S$. The (topological) quotient space $S/\\sim$ is the set of equivalence classes of $S$ under $\\sim$. In other words, $S/\\sim$ is the space obtained by gluing together all the points in $S$ that are equivalent to each other:\n\\[ \\begin{align} [s]_\\sim \u0026= \\lbrace s' \\in S\\ \\vert\\ s' \\sim s \\rbrace \\end{align} \\] $S/\\sim$ is the set of all such (distinct) equivalence classes. We say that $s_0\\in S$ is a representative of the equivalence class $[s_0]_\\sim$. It\u0026rsquo;s easy to see that the equivalence classes satisfy a property similar to (2). In particular,\n\\[ \\begin{align} s_1,s_2\\in [s]_\\sim \u0026\\iff [s_1]_\\sim=[s_2]_\\sim \\\\ \u0026\\iff s_1 \\sim s_2 \\end{align} \\] A corresponding \u0026lsquo;quotient topology\u0026rsquo; can be inherited by $S/\\sim$ from $S$ (that\u0026rsquo;s what makes this a topological quotient), but we need not get into the details of that right now.\nA special type of equivalence relation, $\\sim_H$, is one that is given by the action of some group $H$ on $S$. Let $H$ act on $S$ as $h(s)\\in S$, then $[s]_{\\sim_H} $ is defined as the orbit of $s$ under $H$:\n\\[ \\begin{align} s_1, s_2 \\in [s]_{\\sim_H} \u0026\\iff \\exists h\\in H:\\ s_1 = h(s_2) \\end{align} \\] $S/\\sim_H$ is called an orbit space.\nIf we replace $S$ by a Lie group $G$, we get (at least in terms of the notation) the concept of a coset space, $G/\\sim_H$. As mentioned earlier, $G/\\sim_H$ need not constitute a group (i.e., we will not necessarily define a group operation between the equivalence classes), so it should suffice to think about it topologically. Let $H$ act on the Lie group $G$ by right multiplication, i.e., $h(g) = g\\odot h$. Then, the orbit space $G/\\sim_H$ is the set of all left cosets of $H$ in $G$, which is precisely the left coset space, $G/H$. Thus,\n\\[ \\begin{align} g_1, g_2 \\in [g]_{\\sim_H} \u0026\\iff \\exists h\\in H:\\ g_1 = g_2 \\odot h \\end{align} \\] or more concisely,\n\\[ \\begin{align} g_1, g_2 \\in [g]_{\\sim_H} \u0026\\iff g_2^{-1} \\odot g_1 \\in H. \\end{align} \\] It can be shown that this characterization of a coset space works just as well as the one in (1) and (2).\nHomogeneous Spaces While $G/H$ need not have a group operation associated with it, it is possible to let $G$ act on $G/H$ by left multiplication, i.e., $g\\odot [g^\\prime]_{\\sim_H} = [g\\odot g^\\prime]_{\\sim_H}$. Since this action can be shown to be transitive, $G/H$ is a homogeneous space . The stabilizer of $[e]_{\\sim_H}\\in G/H$ is $H$, which is to say that $H$ is the subgroup of actions in $G$ that act trivially (by the identity map) on $[e]_{\\sim_H}$.\nConversely, given a homogenous space $X$ on which a group $G$ acts, we can construct a coset space by choosing a distinguished point $x\\in X$. Letting $H$ denote the stabilizer of $x$ in $G$, $X$ is shown to be isomorphic to $G/H$. Thus, a homogenous space is obtained after we (i) equip a coset space of a Lie group with the appropriate action, and (ii) \u0026ldquo;forget\u0026rdquo; the identity coset $[e]_{\\sim_H}$ of $G/H$. The last point is similar to how an affine space is obtained after we forget the origin of $\\mathbb R^n$. Another way of stating the above is that there is no distinguished point in a homogenous space, so the identity coset should be treated the same as any other.\nQuotient Groups The quotient of topological spaces is distinct from the quotient of groups, for the same reason as why topological homomorphisms (i.e., continuous maps) are different from Lie group homomorphisms. While $G/H$ is always a topological quotient space (in the sense of $G/\\sim_H$), it is a quotient group when $H$ is a normal subgroup of $G$ \u0026ndash; written as $H \\trianglelefteq G$. In other words, the fact that $H$ happens to be a normal subgroup enables us to come up with a well-defined group operation on $G/H$.\nBy the defining property of a normal subgroup, we have that $g\\odot H \\odot g^{-1}= H$ for all $g\\in G$ (i.e., $H$ is invariant under conjugation). The group operation $\\star$ of $G/H$ is then given by\n\\[ \\begin{align} \\left[ g_1 \\odot H \\right] \\star \\left[ g_2 \\odot H \\right] = \\left[ (g_1 \\odot g_2) \\odot H\\right]. \\end{align} \\] The fact that $H$ is normal makes this definition independent of the choice of representatives, $g_1$ and $g_2$.\nAn example of a quotient group is $\\mathbb R^2/\\mathbb Z^2$, where $\\mathbb Z^2$ is the integer lattice in $\\mathbb R^2$ considered as a group under addition. Topologically, $\\mathbb R^2/\\mathbb Z^2$ is a torus, and can be used to model for instance the repeating pattern of a crystal structure or a tesselation. Another example is $SO(3)/SO(2)$, which is topologically a $2$-sphere (i.e., the surface of a ball in $\\mathbb R^3$). Notice that the subgroup of rotations that stabilize a given point on the sphere is isomorphic to $SO(2)$.\nProducts Direct Products The direct product of two groups $(H, \\overset{H}{\\odot})$ and $(K, \\overset{K}{\\odot})$ is the set of ordered pairs $H \\times K$ with the group operation\n$$(h_1, k_1) \\overset{H\\times K}{\\odot} (h_2, k_2) = (h_1 \\overset{H}{\\odot} h_2, k_1 \\overset{K}{\\odot} k_2).$$This can be an uninteresting construction, however, since it doesn\u0026rsquo;t intertwine the group operations of $H$ and $K$ in any meaningful way. The group $\\mathbb R^2 = \\mathbb R \\times \\mathbb R$ is the direct product of two copies of $\\mathbb R$ (where each group operation is assumed to be the vector addition.) The group $SO(2) \\times SO(2) \\cong \\mathbb R^2/\\mathbb Z^2$ may be used to describe (topologically) the configuration space of a robotic arm with a hinge. In the case of matrix Lie groups, a direct product is represented by block diagonal matrices containing a block from each constituent group.\nInner Semi-Direct Products The inner semi-direct product is a way of expressing a group $G$ as the product of two of its subgroups. Let\n\\[ N,H \\leq G, \\quad N \\trianglelefteq G \\] Moreover, we require that $NH=G$ and $H \\cap N = \\lbrace e \\rbrace$, i.e., the subgroups \u0026lsquo;span\u0026rsquo; $G$ and are (as sets) complements of each other in $G$. If the preceding statements hold, then $G$ is the inner semi-direct product of $H$ acting on $N$, written as $G = N \\rtimes H$ (the triangle helps us remember that $N$ is normal). The group operation $\\star$ is given by\n\\[ \\begin{align} (n_1, h_1) \\star (n_2, h_2) = \\big(n_1 ( h_1 n_2 h_1^{-1} ),h_1 h_2\\big) \\end{align} \\] where $h_1 h_2$ is a short-hand for $h_1 \\odot h_2$, and $\\star$ is the newly-defined group operation.\nThis video does a commendable job of explaining the motivation behind this definition \u0026ndash; the gist of it is as follows. Recall that $NH=G$. It can be shown with some effort that the map $\\phi: N \\times H \\rightarrow G$ defined by $\\phi(n,h) = n h$ is a bijection of sets. This motivates us to go a little further and see if we can turn $\\phi$ into a group isomorphism. Now, observe that\n\\[ \\begin{align} \\phi\\big((n_1, h_1) \\star (n_2, h_2)\\big) \u0026= \\phi\\big(n_1 ( h_1 n_2 h_1^{-1} ),h_1 h_2\\big) \\nonumber \\\\ \u0026= n_1 ( h_1 n_2 h_1^{-1} ) h_1 h_2 \\nonumber\\\\ \u0026= n_1 h_1 n_2 h_2 \\nonumber\\\\ \u0026= \\phi(n_1, h_1) \\phi(n_2, h_2) \\end{align} \\] This shows that $\\phi$, together with the operation $\\star$ defined as above, is actually an isomorphism of groups (i.e., it preserves the group operation, and is bijective as a map between the underlying sets). So, it is enough to remember that $\\star$ should be defined in a way that the $h_1^{-1}h_1$ in \u0026lsquo;$n_1 ( h_1 n_2 h_1^{-1} ) h_1 h_2$\u0026rsquo; should cancel out. Lastly, note that $N$ needs to be a normal subgroup so that $ h_1 n_2 h_1^{-1}$ is indeed in $N$.\nThe direct product of sets is not the same as the direct product of groups. In the definition of $\\phi$, we used '$N \\times H$' to denote the set-theoretic product of $N$ and $H$. We did not prescribe a group operation on '$N \\times H$', and we certainly did not impose the direct-product group operation on it. To reiterate, $N \\rtimes H$ is the group given by the set of elements '$N \\times H$' along with the group operation $\\star$ as defined above. Outer Semi-Direct Products The outer semi-direct product follows by observing that the preceding definition can be generalized to the case where $H$ and $N$ are not subgroups of some common group. Here, the group $G$ is not given to us to begin with \u0026ndash; it is obtained as a result of performing the semi-direct product construction.\nTo define an outer semi-direct product, we require an action $\\Phi_h(\\cdot):N \\rightarrow N$ for each element $h\\in H$ such that $\\Phi_h(\\cdot)$ is a group automorphism of $N$. For instance, observe that if $H,N \\leq G$ and $N\\trianglelefteq G$, then the choice of the action $\\Phi_h(n) \\coloneqq h n h^{-1}$ will make the outer semi-direct product coincide with the inner semi-direct product (i.e., the latter is a special case of the former). In the more general setting where $H$ and $N$ are only related to each other via actions of the form $\\Phi_h(n)$, the group operation $\\star$ is given by\n\\[ \\begin{align} (n_1, h_1) \\star (n_2, h_2) = \\big(n_1 \\Phi_{h_1}(n_2),h_1 h_2\\big) \\end{align} \\] Letting $G = N \\rtimes H$ as above, we have that $H \\cong G/N$ (also see group extension below). We say that $\\Phi_h(\\cdot)$ twists the group multiplication; a trivial twist of $\\Phi_h(n)=n$ reduces the semi-direct product to a direct product.\nExample 1: $SE(3) \\cong \\mathbb R^3 \\rtimes SO(3)$, where the action of $SO(3)$ on $\\mathbb R^3$ is given by $\\Phi_R(p) = Rp$. Thus,\n\\[ \\begin{align} (p_1, R_1) \\star (p_2, R_2) = \\big(p_1 + R_1p_2,R_1 R_2\\big) \\end{align} \\] where the $+$ comes from the group operation of $\\mathbb R^3$, the matrix-vector multiplication comes from $\\Phi_R$, and the matrix-matrix multiplication comes from $SO(3)$. Remember that the group that the triangle points towards (i.e., $\\mathbb R^3$) plays the role of the normal subgroup, whereas the other group ($SO(3)$) acts on the former via automorphisms.\nExample 2: $GL(n;\\mathbb R) \\cong SL(n;\\mathbb R) \\rtimes \\mathbb R^\\times$\nIt\u0026rsquo;s not too hard to see that these groups should be related to each other as above. The elements of $SL(n;\\mathbb R)$ are determinant-one matrices, and the group $\\mathbb R^\\times$ scales the determinant up and down to recover the remaining matrices of $GL(n;\\mathbb R)$. However, why is this not the same as $SL(n;\\mathbb R) \\times \\mathbb R^\\times$? We should try to answer this question, so that the distinction between direct and semidirect products becomes clear to us.\nAny direct product should have a projection onto either of its factors, similar to how one is able to extract the $\\textrm{x}$ and $\\textrm{y}$ coordinates of a vector in $\\mathbb R^2$. When $n$ is odd, we have the following projection from $GL(n;\\mathbb R)$ to $SL(n;\\mathbb R)$:\n\\[ (A) \\mapsto \\frac{A}{\\big[\\textrm{det}(A)\\big]^{1/n}} \\] since\n$$ \\textrm{det}\\left(\\frac{A}{\\big[\\textrm{det}(A)\\big]^{1/n}}\\right) = \\frac{\\textrm{det}(A)}{[\\textrm{det}(A)]^{n/n}} =1. $$This would not work if $n$ were even, however, since $\\big[\\textrm{det}(A)\\big]^{1/n}$ may not have a real value in these cases1. On the other hand, the semi-direct product does not require that there be a projection from $GL(n;\\mathbb R)$ to $SL(n;\\mathbb R)$ to begin with.\nSo, $GL(n;\\mathbb R) \\cong SL(n;\\mathbb R) \\rtimes \\mathbb R^\\times$ always, and $GL(n;\\mathbb R) \\cong SL(n;\\mathbb R) \\times \\mathbb R^\\times$ when $n$ is odd, which is the content of Problem 7-21 of Lee\u0026rsquo;s book (also see the errata ). We will come back to this example in the context of group extensions, below.\nExample 3: $O(n) \\cong SO(n) \\rtimes O(1)$, which is also an example from Lee\u0026rsquo;s book. Note that $O(1)$ is simply a group of the form $\\lbrace 1, -1 \\rbrace$. As before, we can also express this as a direct product when $n$ is odd.\nGroup Extensions Given a homomorphism $A\\rightarrow B$ between objects $A$ and $B$ of some type, the kernel of the homomorphism is the set of elements of $A$ that map to the identity of $B$. For example, the kernel of (the linear transformation corresponding to) a matrix is its null space. It can be shown quite easily that a group homomorphism $f:G\\rightarrow G^\\prime$ should map the identity of $G$ to the identity of $G^\\prime$, so $\\ker(f)$ will at least contain the identity element of $G$.\nThe following sequence of group homomorphisms is an example of a short exact sequence , in which the image of each homomorphism happens to be the kernel of the next homomorphism:\n\\[ 1 \\rightarrow N \\overset{\\iota}{\\rightarrow} G \\overset{\\pi}{\\rightarrow} H \\rightarrow 1 \\] Here, $G$ is said to be an extension of $H$ by $N$, and $1=\\lbrace e\\rbrace$ is the trivial one-element group. It\u0026rsquo;s quite remarkable how much this succinct expression says about the relationships between $N$, $G$, and $H$. Firstly, $1$ maps to the identity of $N$ (since group homomorphisms preserve identities), and the same occurs on the right end of the sequence. We know that $\\iota{(N)}$ is a normal subgroup of $G$ by the first isomorphism theorem of Noether . The first isomorphism theorem also says that the quotient $G/\\iota{(N)}$ is isomorphic to $H$.\nIn the typical mathematical notation, $\\iota$ denotes an inclusion and $\\pi$ a projection. An inclusion maps a subset of a set injectively (though in all likelihood, not surjectively) into the parent set. A similar definition follows for the inclusion of a subspace onto a vector space, a topological subspace into its ambient space, and so on. A projection is something we\u0026rsquo;ve seen before in the context of a fiber bundle . This suggests that $G\\cong N\\times H$ will always satisfy such a short exact sequence, with $\\pi$ defined quite literally as a projection.\nThen arises the question of when $G$, if given as a group extension, is the semi-direct product of $H$ acting on $N$. Since $G$ need not have a projection onto $N$ in a semi-direct product, so we certainly don\u0026rsquo;t need to have a morphism $N \\leftarrow G$. Instead, what characterizes a semi-direct product precisely is the existence of a group homomorphism $\\sigma: H \\rightarrow G$, called a section or a splitting , such that $\\pi \\circ \\sigma=\\textit{Id}$ (the identity map). This is in part because we can then define the action $\\Phi_h(n)$ as $\\sigma(h) n \\sigma(h)^{-1}$ via the conjugations in $G$, giving us everything we need to construct the group operation.\nExample 2 (Revisited):\nWe have the following short exact sequence:\n\\[ 1 \\rightarrow SL(n;\\mathbb R) \\overset{\\iota}{\\rightarrow} GL(n;\\mathbb R) \\overset{\\pi}{\\rightarrow} \\mathbb R^\\times \\rightarrow 1 \\] To check that this is a valid short exact sequence, we need the image of $\\iota$ to be the kernel of $\\pi$. Let $\\iota$ be the inclusion map and $\\pi=\\textrm{det}(\\ \\cdot\\ )$, then, the kernel of $\\pi$ is the set of matrices with determinant $1$, which is precisely $SL(n; \\mathbb R)$. Thus, we already know that $\\mathbb R^\\times \\cong GL(n;\\mathbb R)/SL(n;\\mathbb R)$ by the isomorphism theorem. To check that $GL(n;\\mathbb R) \\cong SL(n;\\mathbb R) \\rtimes \\mathbb R^\\times$, we need to find a section $\\sigma: \\mathbb R^\\times \\rightarrow GL(n;\\mathbb R)$ such that $\\pi \\circ \\sigma=\\textit{Id}$. Letting\n\\[ \\sigma(\\lambda) = \\begin{bmatrix} \\lambda \u0026 0 \u0026 \\cdots \u0026 0 \\\\ 0 \u0026 1 \u0026 \\cdots \u0026 0 \\\\ \\vdots \u0026 \\vdots \u0026 \\ddots \u0026 \\vdots \\\\ 0 \u0026 0 \u0026 \\cdots \u0026 1 \\end{bmatrix}\\]\ncompletes the argument.\nNote that $\\sigma \\circ \\pi \\neq \\textit{Id}$. This is what differentiates a section/splitting from an inverse, and prevents $GL(n;\\mathbb R)$ and $\\mathbb R^\\times$ from being isomorphic (which they\u0026rsquo;re clearly not). Also, $\\pi$ and $\\sigma$ do not constitute a bijection, which is one of the requirements of a group isomorphism.\nGeometry Metrics Suppose that we are given (or choose) an inner product on $T_eG$, the tangent space at the identity of $G$ (which is also the vector space underlying $\\mathfrak g$). We denote this as $\\langle \\tilde X, \\tilde Y \\rangle_e$ for $\\tilde X, \\tilde Y \\in T_e G$. We can extend this metric to obtain a left-invariant metric on the entire Lie group $G$, as follows:\n\\[ \\langle X_g, Y_g \\rangle_g \\coloneqq \\langle (\\mathcal L_{g^{-1}*}) X_g,\\ (\\mathcal L_{g^{-1}*}) Y_g \\rangle_e \\] The reason this is called left-invariant is because2\n\\[ \\langle \\mathcal L_{h*} X_g, \\mathcal L_{h*} Y_g \\rangle _{hg} = \\langle X_g, Y_g \\rangle _g. \\] A bi-invariant metric is one that is simultaneously left- and right-invariant. The following Lemmas from a paper by Milnor characterize the Lie groups that admit bi-invariant metrics:\nLemma 7.1 A left-invariant metric on $\\small{G}$ is also right invariant if and only if, for each $\\small{g\\in G}$, $\\small{\\textrm{Ad}}_g$ is an isometry of the inner product on $\\small{\\mathfrak g}$ [that generates said left-invariant metric]. This is also Proposition 3.12 of Lee\u0026rsquo;s book on Riemannian Manifolds. It states that $\\textrm{Ad}_g$ are given by $O(n)$ matrices acting on $\\mathfrak g$.\nLemma 7.2 A left invariant metric on a connected Lie group is also right invariant if and only if $\\small{\\textrm{ad}_X}$ is skew-adjoint for every $\\small{X\\in\\ }\\mathfrak g$. This follows as a consequence of Lemma 7.1, since $\\mathfrak o(n)$ are precisely the skew-symmetric matrices. The next result is the most striking of the three, and immediately gives us examples of Lie groups that admit bi-invariant metrics:\nLemma 7.5 A connected Lie group admits a bi-invariant metric if and only if it is isomorphic to the direct product of a compact group and a commutative group. As a corollary, every compact Lie group (including closed and bounded matrix Lie groups like $SO(n)$) admits a bi-invariant metric. While $SO(n) \\times \\mathbb R^n$ admits a bi-invariant metric, $SE(n) \\cong SO(n) \\ltimes \\mathbb R^n$ does not. (Nevertheless, $SE(n)$ has the next best thing: a bi-invariant measure.)\nAn interesting property of a Lie group that admits a bi-invariant metric is that its one-parameter subgroups are also geodesic (i.e., shortest-distance) paths of the metric.\nMeasures While metrics help assign values to the lengths of tangent vectors (and by extension, to curves), measures assign values to subsets. In classical calculus, the expression $\\int_{\\mathbb R} f(x) dx$ refers to integration with respect to a specific, canonical choice of a measure \u0026ndash; the Lebesgue measure. The Lebesgue measure is uniquely determined (up to a scaling factor) by the fact that it should be translation invariant, i.e., $\\mu([0,1]) = \\mu ([1,2])$, and a few other properties that are viewed as being natural to the structure of $\\mathbb R^n$. It turns out that there is a canonical measure on Lie groups as well, called the (left) Haar measure, that satisfies $\\mu_G(gA) = \\mu_G(A)$ for all $g\\in G$ and $A\\subset G$, where\n$$gA \\coloneqq \\lbrace ga\\ \\vert\\ a\\in A \\rbrace.$$The Lebesgue measure is a special case of the Haar measure on $\\mathbb R^n$; the latter generalizes what we mean by translation to include any group operation. Like the Lebesgue measure, the left Haar measure can be shown to be unique up to scaling. In a compact group, a unique Haar measure can be found by normalizing the scaling factor to $1$ so that $\\int_G d\\mu_G=\\mu(G)=1$. If $\\mu_G$ is a left Haar measure and $g\\in G$, then $\\tilde \\mu^{(g)}_G \\coloneqq \\mu_G\\circ \\mathcal R_{g}$ is also a left Haar measure, since\n$$\\tilde \\mu^{(g)}_G(hA) = \\mu_G(hAg) = \\mu_G(Ag) = \\tilde \\mu^{(g)}_G(A)$$for all $h\\in G$. But since the left Haar measure is unique up to scalar multiplication, it must be the case that $\\tilde \\mu^{(g)}_G = \\Delta (g) \\mu_G$ where $\\Delta (g)$ is a scalar-valued function that only depends on $g$. This is called the modular function of $G$ \u0026ndash; once computed, it is independent of the choice of the Haar measure $\\mu_G$ that is used to compute it. It relates the left Haar measure to the right Haar measure. A lot of the properties and implications of $\\Delta(\\ \\cdot\\ )$ follow from showing that it is a group homomorphism from $G$ to $\\mathbb R^\\times_{\u003e0}$.\nA Lie group is said to be unimodular iff its left Haar measure is also a right Haar measure, i.e., it is also right-invariant. Equivalently, a group is unimodular iff its modular function is $\\Delta(g)=1$ for all $g\\in G$. Milnor once again saves us from needing to check this condition explicitly:\nLemma 6.1 A Lie group is unimodular if and only if the linear transformation $\\textrm{Ad}_g$ has determinant $\\pm 1$ for every $g\\in G$.\nThe absolute value of this determinant is precisely the modular function , $\\Delta(g)$. Recall or verify that $\\lvert\\det(\\ \\cdot\\ )\\rvert$ is a Lie group homomorphism from $GL(n; \\mathbb R)$ to $\\mathbb R^\\times_{\u003e0}$, the multiplicative group of positive real numbers.\nLemma 6.3 A connected Lie group is unimodular if and only if the linear transformation $\\textrm{ad}_X$ has trace zero for every $X\\in\\mathfrak g$.\nCompare the relationship between these lemmas with the fact that $\\textrm{det}(e^A) = e^{\\textrm{tr}(A)}$ for any matrix $A$.\nNote that for matrix Lie groups, $\\textrm{ad}_X(Y)$ is always trace zero, since $\\textrm{ad}_X(Y) = XY - YX$ and $\\textrm{tr}(XY) = \\textrm{tr}(YX)$. Lemma 6.3 is talking of the trace of $\\textrm{ad}_X$ as a linear transformation on $\\mathfrak g$ (as opposed to the trace of $\\textrm{ad}_X(Y)$). Moreover, for computing the trace of $\\textrm{ad}_X$ it does not matter what basis of $\\mathfrak g$ we choose, since the trace is a property of the underlying linear transformation rather than of the matrix. As a special case of Lemma 6.3, the $\\textrm{ad}_X$ matrices corresponding to a nilpotent Lie algebra are given by nilpotent matrices , which have trace zero. Thus, every nilpotent Lie group is unimodular. Moreover, compact Lie groups are unimodular because $\\Delta(g)$ is a homomorphism into the non-compact group $\\mathbb R^\\times_{\u003e0}$, whereas the only compact subgroup of $\\mathbb R^\\times_{\u003e0}$ is $\\lbrace 1 \\rbrace$.\nBi-invariant Haar measures are incredibly useful because they offer a way to integrate on Lie groups (as well as to define probability densities, Fourier transforms, etc.) while retaining the intuitive and attractive properties of the Lebesgue integral:\n\\[ \\begin{align} \\int_G f(g) d\\mu_G(g) \u0026= \\int_G f(g^{-1}) d\\mu_G(g) \\nonumber\\\\ \u0026= \\int_G f(hg) d\\mu_G(g) \\nonumber\\\\ \u0026= \\int_G f(gh) d\\mu_G(g) \\nonumber \\end{align} \\] where $f:G\\rightarrow \\mathbb R$ is a function, and $g \\mapsto hg$ is to be viewed as the analogue of translation in $\\mathbb R^n$. By requiring $f$ to be compactly supported and/or an integrable function and normalizing it, one is able to define a probability density function on $G$.\nIn the next post , I use the machinery developed in the book Introduction to Smooth Manifolds by John M. Lee to study the properties of Riemannian metrics and measures. If you would rather skip ahead to the actual calculations, I point you toward the books by G.S. Chirikjian . It contains formulae for differentiation and integration on the (matrix) Lie groups that are commonly encountered in engineering applications.\nNote that $f(x) = x^{1/n}$ is a multi-valued function. Since it is an $n^{th}$ degree polynomial, it will have exactly $n$ (not necessarily distinct) complex roots.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nThis property makes the right-invariant vector fields on $G$ examples of Killing vector fields with respect to the left-invariant metric \u0026ndash; this is because left-invariant flows are given by right-multiplication , and vice versa. Also see this .\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://shirazkn.github.io/posts/lie-groups_construction/","summary":"There are multiple ways to construct new groups from old ones. I provide an intuition for how these constructions work, and also go over some of the additional structures that can be imposed on Lie groups, paving the path towards doing differential geometry and calculus on Lie groups.","title":"Lie Groups: Construction and Geometry"},{"content":"A topological group is a set of elements $G$ that has both a group operation $\\odot$ and a topology . The group operation satisfies the usual axioms (same as those of finite groups ), and the presence of a topology lets us say things like \u0026rsquo;the group is connected\u0026rsquo; and \u0026rsquo;the group operation is continuous\u0026rsquo;. $G$ is called a Lie group if it is also a smooth manifold. The smooth structure of the manifold must be compatible with the group operation in the following sense: $\\odot$ is differentiable with respect to either of its arguments 1. The compatibility of its constituent structures is what makes a Lie group so special, enabling it to capture the essence of a continuous symmetry .\nA different (but closely related) mathematical object is the Lie algebra. A Lie algebra $\\mathfrak v$ is a vector space equipped with an operation called the Lie bracket, $[\\cdot, \\cdot]: \\mathfrak v \\times \\mathfrak v \\to \\mathfrak v$, that satisfies certain properties that parallel those of a \u0026lsquo;cross product\u0026rsquo;. While a Lie algebra may exist in the absence of an associated Lie group, every Lie group gives rise to a Lie algebra2. In other words, we can associate to each Lie group $G$ a corresponding Lie algebra, with the latter typically denoted as $\\mathfrak g$ to emphasize its relationship to $G$. Letting $e$ denote the identity element of $G$, we will see that $T_e G$ (the tangent space of $G$ at $e$), together with an appropriately defined bracket operation, is a natural candidate for $\\mathfrak g$. Consider as an example $SO(2)$, the group of rotation matrices of $\\mathbb R^2$ having determinant $1$, whose identity element $e$ is the identity matrix $I$. The tangent space $T_ISO(2)$ consists of the $2\\times 2$ skew-symmetric matrices. Skew-symmetric matrices represent infinitesimally small rotations, since near the identity element of $SO(2)$ (i.e., the identity matrix), we have\n\\[ \\begin{align} \\theta \\approx 0 \\Rightarrow \\begin{bmatrix} \\ \\ \\cos\\theta \u0026 -\\sin\\theta\\ \\ \\\\ \\ \\ \\sin\\theta \u0026 \\cos\\theta \\end{bmatrix} \\approx I + \\begin{bmatrix} \\ 0 \u0026 -\\theta \\ \\\\ \\ \\theta \u0026 0 \\end{bmatrix} \\end{align} \\] Still, the above observation alone does not make it clear what the relationship between $SO(2)$ and $T_ISO(2)$ is. For starters, why should one expect infinitesimal rotations to be related in any way to arbitrary (large angle) rotations? What is the significance of the Lie bracket?\nThis post is by no means meant to be an introduction to Lie groups; for that, I recommend the first few chapters of Brian C. Hall\u0026rsquo;s book. I will instead hurry us along to our main line of investigation \u0026ndash; understanding the Lie group-Lie algebra correspondence, pausing only to show you some pictures/diagrams that I had fun drawing. A bonus takeaway from this post will be a deeper understanding of the exponential map, one that unifies the exponentials of real numbers, complex numbers, and matrices.\nPrerequisites: Conceptual understanding of the tangent spaces of a smooth manifold and familiarity with the standard examples of matrix Lie groups, including $SO(n)$, $GL(n,\\mathbb R)$ and $\\mathbb R^n$. Background The details in this section may be skipped , but I suggest looking at the illustration below before moving on.\nPushforwards Given a smooth, parameterized curve $\\gamma:\\mathbb R \\rightarrow G$, let $(U,h)$ be a chart of $G$ such that $\\gamma (0)\\in U$. Observe that $h\\circ \\gamma:\\mathbb R\\rightarrow \\mathbb R^n$ can be differentiated in the usual way, and that $\\frac{d}{dt}\\left[h\\circ \\gamma (t)\\right]\\big\\vert_{t=0}$ is simply a vector in $\\mathbb R^n$. All of the curves on $G$ that result in a given vector of $\\mathbb R^n$ when differentiated as above represent the same tangent vector, i.e., a single element of the tangent space $T_{\\gamma(0)}G$.\nLet $[$$\\gamma$$]$ denote the tangent vector (or more precisely, the equivalence class) corresponding to the curve $\\gamma$. We say that $[$$ h\\circ \\gamma$$]$ is the pushforward of the tangent vector $[$$\\gamma$$]$ under the map $h$. More generally, if $f:\\mathcal M \\rightarrow \\mathcal N$ is a smooth map between manifolds, then the differential of $f$ at $p\\in \\mathcal M$ is the linear operator that maps tangent vectors at $p$ to their pushforwards at $f (p)$:3\n\\[ \\begin{align} (f_*)_p:\\ T_p \\mathcal M \u0026\\rightarrow T_{f (p)} \\mathcal N\\\\ \\mathbf v \u0026\\mapsto (f_*)_p \\mathbf v \\nonumber \\end{align} \\] In practice, $(f_{\\ast})_p$ ends up looking something like the Jacobian of $f$ evaluated at $p$. The caveat is that a Jacobian (matrix) maps vectors in $\\mathbb R^n$ to vectors in $\\mathbb R^m$, whereas $(f_{\\ast})_p$ does the more general job of mapping vectors in $T_p \\mathcal M$ to vectors in $T_{f (p)} \\mathcal N$.\nGiven $g\\in G$, let $\\mathcal L_g:G\\rightarrow G$ denote left-multiplication by $g$, i.e., $\\mathcal L_g(h) = g\\odot h$ for all $h\\in G$. Here\u0026rsquo;s how a tangent vector at the identity $e\\in G$ can be \u0026lsquo;pushed forward\u0026rsquo; by the left-multiplication map $\\mathcal L_g$:\nwhere the curve passing through $g$ was obtained by composing $\\gamma$ with $\\mathcal L_g$. Since $T_eG$ is going to be identified4 with $\\mathfrak g$ (as a vector space), the above illustration is going to play a key role in the forthcoming discussion. It shows that $(\\mathcal L_{g^{-1}*})_g$$=(\\mathcal L_{g{\\ast}})_e ^{-1}$ will reduce a tangent vector at $g$ to an element of the Lie algebra.\nBy reuse of notation, we can also \u0026lsquo;push forward\u0026rsquo; entire vector fields (when $f$ is a diffeomorphism):\n\\[ \\begin{align} f_*:\\ \\mathfrak X (\\mathcal M) \u0026\\rightarrow \\mathfrak X (\\mathcal N)\\\\ X \u0026\\mapsto f_* X \\nonumber \\end{align} \\] where $\\mathfrak X(\\cdot)$ denotes the set of all smooth vector fields on a manifold.\nMorphisms Most (if not all) mathematical objects come with a distinctive structure; for topological spaces, it is their topology/open sets, for vector spaces their vector addition and scalar multiplication operations, for finite groups the existence of inverses, and so on. Mappings between objects of the same type that preserve these structures are called homomorphisms (or in the jargon of category theory, simply morphisms). The homomorphisms between vector spaces are the linear transformations between them. Suppose $f:V\\rightarrow W$ is a linear transformation, then\n\\[ f(v_1 \\overset{V}{+}v_2)=f(v_1)\\overset{W}{+}f(v_2) \\in W, \\] which shows that the structure of the vector addition operation $\\overset{V}{+}$ of $V$, has been transported to that of the $\\overset{W}{+}$ operation of $W$. This suggests that homomorphisms (i.e., structure-preserving maps) may be paramount to the study of the underlying mathematical structure, which is indeed the case (see linear algebra).\nIf $A$ and $B$ are two objects of the same type and $f$ a homomorphism between them, we simply write\n\\[ \\begin{array}{c} A \\overset{f}{\\longrightarrow} B \\end{array} \\] where the meaning of $f$ depends on which type of mathematical structure is being transported. A homomorphism for which there also exists an \u0026lsquo;inverse homomorphism\u0026rsquo; $g:B \\rightarrow A$, such that $f \\circ g = g\\circ f =\\ $the identity map, is called an isomorphism. Isomorphisms between vector spaces are those linear transformations that can be represented as invertible matrices. Beware that neither word, homomorphism or isomorphism, should be uttered unless the structure in question is contextually obvious. Two objects are never simply isomorphic, they are isomorphic as vector spaces, or isomorphic as topological spaces, and so on.\nA homomorphism between topological spaces is a continuous map between them. The word homeomorphism, rather confusingly, refers to an isomorphism (and not a homomorphism) between topological spaces; this piece of nomenclature is quite a tragedy. An isomorphism between smooth manifolds is a differentiable map with a differentiable inverse \u0026ndash; a diffeomorphism.\nA Lie group homomorphism $\\varphi$ is a map between two Lie groups that preserves both the group operation and the topology; i.e., it is simultaneously a group homomorphism and a continuous map. The best way to understand what a group homomorphism entails is through a commutative diagram:\nWe say that this diagram commutes if the two compositions of arrows (top-right and left-bottom, each of which results in $\\searrow$) are in fact the same arrow:\n\\[ \\begin{align} \\mathcal L_{\\varphi (g)}\\circ \\varphi = \\varphi \\circ \\mathcal L_g \\end{align} \\] where $\\circ$ indicates the composition of functions. Feeding an argument $\\tilde g\\in G$ on either side, we get \\[ \\begin{align} \\varphi (g) \\overset{H}{\\odot}\\varphi(\\tilde g) \u0026= \\varphi (g \\overset{G}{\\odot} \\tilde g) \\\\ \\end{align} \\] where $\\overset{G}{\\odot}$ is the group operation in $G$ and $\\overset{H}{\\odot}$ is the group operation in $H$. This makes it (at least notationally) clear that $\\varphi$ preserves the group structure, though one should work out the consequences of this definition; for instance, it can be shown that $\\varphi$ should map the identity of $G$ to the identity of $H$. An example of a Lie group homomorphism is the determinant of a matrix, $\\textrm {det}:GL(n;\\mathbb R) \\rightarrow \\mathbb R^\\times$, since\n\\[ \\begin{align} \\textrm {det}(AB) = \\textrm {det}(A) \\textrm{det}(B) \\end{align} \\] Here, $GL(n;\\mathbb R)$ is the general linear group consisting of $n\\times n$ invertible matrices and $\\mathbb R^\\times$ is the multiplicative group of real numbers (crucially, $0\\notin \\mathbb R^\\times$). Observe that $\\textrm {det}(I)=1$ as promised.\nInvariant Vector Fields As I hinted at previously, the key to uncovering the Lie group-Lie algebra correspondence is to study the topological and group structures of $G$ simultaneously. How does one do this? A good starting point would be to specialize the curves and vector fields considered above (which are topological objects) to those special ones that also respect the group structure of $G$.\nA similar discussion holds for right-multiplication, but going down that path will yield the same Lie algebra, albeit with some sign changes. This brings us to a central object in the study of Lie groups, the space of left/right-invariant vector fields. A left-invariant vector field is one for which the vectors at two different points are related by the pushforward operator $\\mathcal L_{g*}$ corresponding to left-multiplication by the group elements. More rigorously, a vector field $X\\in \\mathfrak X (G)$ is said to be left-invariant if the following diagram commutes: For example, we have for $h\\in G$,\n\\[ \\begin{align} X \\circ \\mathcal L_g (h) = (\\mathcal L_{g*})_h X(h) \\end{align} \\] $=X(g\\odot h)$. An important consequence of this definition is that just by knowing $X(e) \\in T_e G$, we can determine the value of $X$ at all the other points, since $X(g) = (\\mathcal L_{g*})_e X(e)$. In fact, one can construct a left-invariant vector field by picking any vector $\\tilde X \\in T_eG$ and defining $X(g) \\coloneqq (\\mathcal L_{g*})_e \\tilde X$. Conversely, given a left-invariant vector field $X$, we can simply evaluate it at the identity $e$ to determine the $\\tilde X$ that generated it. Thus, the space of left-invariant vector fields on $G$, written as $\\mathfrak X^{\\mathcal L}(G)$, is isomorphic to $T_eG$ as a vector space (the fact that $\\mathfrak X^{\\mathcal L}(G)$ can be given a vector space structure is for the reader to deduce). The word \u0026rsquo;left-invariant\u0026rsquo; comes from the fact that $\\mathcal L_{g*}X= X$.\nA left-invariant vector field $X\\in \\mathfrak X^{\\mathcal L}(G)$ is special because it represents \u0026lsquo;water\u0026rsquo; flowing along the surface of $G$ in perfect concordance with the group structure of $G$. The fact that such an object can be related to $T_eG$ bodes well for the establishment of $T_eG$ as an object that corresponds to the Lie group $G$. However, some reflection will show that we need to do more work to recover the group structure of $G$ at $T_eG$. For starters, the group multiplication operation $\\odot$ need not be commutative, whereas the vector addition operation $+$ in $T_eG$ is commutative by definition. This is where the Lie bracket comes in; it is a multiplication-like operation that can be imposed on $T_e G$ to in some sense \u0026lsquo;measure the failure of commutativity\u0026rsquo; in $G$. We will revisit this point a little later.\nThe Exponential Map If the Lie algebra is to correspond to the Lie group, the elements of the Lie algebra should be somehow associated with the elements of the Lie group. How do we associate $\\tilde X \\in T_eG$ to a unique group element of $G$? First, we extend $\\tilde X$ to the unique left-invariant vector field $X$ that satisfies $X(g) = (\\mathcal L_{g*})_e \\tilde X$. Thereafter \u0026ndash; and this is going to sound silly \u0026ndash; we place a \u0026lsquo;boat\u0026rsquo; at the identity $e$ and let it flow along the surface of $G$ in the direction of $X$ for exactly one unit of time!\nLet\u0026rsquo;s unpack what that means. The boat is going to trace out a path/curve on $G$, which we denote by $\\gamma :[0, 1] \\rightarrow G$, such that $\\gamma (0)=e$. At time $t\\in[0,1]$, the boat\u0026rsquo;s position is given by $\\gamma(t)\\in G$. Its velocity at time $t$ is given by $\\gamma ^\\prime(t)=X(\\gamma(t))$. Thus, we require that\n\\[ \\begin{align} \\gamma ^\\prime(t) = \\big(\\mathcal L_{\\gamma(t)*}\\big)_e \\tilde X \\end{align} \\] The equation above is a differential equation (or dynamical system) that can be solved to yield a solution (or trajectory) $\\gamma (t)$. The solution is called an integral curve or a flow of $X$ starting at $e$. Of course, we can solve it by using the local charts of $G$ to (locally) reduce it to a system of ordinary differential equations in $\\mathbb R^n$, and then \u0026lsquo;stitching\u0026rsquo; the local solutions together to get the overall curve on $G$. Before I convince you that this can indeed be done, let\u0026rsquo;s exercise prescience in making the following definition :\n\\[ \\begin{align} \\exp:\\ T_e G \u0026\\rightarrow G\\\\ \\tilde X \u0026\\mapsto \\gamma(1) \\nonumber \\end{align} \\] where $\\gamma$ is the integral curve (or flow) that solves $(8)$ for the given choice of $\\tilde X$. Note that $\\gamma ^\\prime(0) = \\tilde X$ is the initial velocity of the boat.\nExample 1: $G=\\mathbb R^\\times$, the Multiplicative Group of Real Numbers In this case, $\\mathcal L_g$ and $(\\mathcal L_{g*})_e$ reduce to the same operation \u0026ndash; multiplication of real numbers5. Equation $(8)$ reduces to\n\\[ \\begin{align} \\gamma ^\\prime(t) = \\gamma(t) \\tilde X \\end{align} \\] where $\\tilde X \\in T_e\\mathbb R^\\times \\cong \\mathbb R$ and $e=1$ is the identity element (of multiplication). By seeking a power series solution (or better yet, through an informed guess), we get\n\\[ \\begin{align} \\gamma(t) = 1 + t\\tilde X + \\frac{t^2}{2!}\\tilde X^2 + \\frac{t^3}{3!}\\tilde X^3 + \\cdots \\end{align} \\] so that $\\exp(\\tilde X)=\\gamma(1)$ is the usual exponential function that we\u0026rsquo;ve come to know and love. By the uniqueness of the solution to an ODE, we have arrived at a well-defined definition for the exponential map in this case.\nRemark: The group $\\mathbb R^\\times$ has a 'hole' at zero, so that our boat cannot flow to the negative real numbers. Consequently, $\\exp$ only maps to the positive reals (i.e., it isn't surjective). Nevertheless, the exponential map is always diffeomorphic (and thus, invertible) near the identity element of $\\mathfrak g$; this can be shown to hold for all Lie groups using the inverse function theorem. Example 2: $G=U(1) \\cong SO(2)$, the Circle Group $U(1)$ is the group of complex numbers of unit modulus, with the group operation $\\odot$ being the multiplication of complex numbers. Since we have already seen how $(8)$ can be solved, a visual depiction of the exponential map might be more gratifying:\nBecause we are able to visualize this Lie group as a (topological) subspace of $\\mathbb R^2$, we can quite literally see the boat flowing along the surface of $G$ in the direction of $X$. Here, $X(e^{i\\theta})=e^{i\\theta}\\tilde X$, so the left-invariant vector fields are generated by sliding $\\tilde X$ along the circle without changing its length.\nExample 3: $G=GL(n;\\mathbb R)$, the Invertible Matrices Equation $(8)$ becomes\n\\[ \\begin{align} \\gamma ^\\prime(t) = \\gamma(t) \\tilde X \\end{align} \\] (Just like in $\\mathbb R^\\times$, matrix multiplication and its differential both reduce to the same operation 5.) The rest follows in the same way as in Example 1.\nGoing back to Example 1, notice that a large negative initial velocity at $1\\in \\mathbb R^\\times$ sends the boat to a small positive number, but never to $0$. For an analogous reason, $\\exp(\\tilde X)$ is always an invertible matrix. As the determinant of $\\gamma(t)$ should change smoothly during the boat\u0026rsquo;s trajectory (and apparently it never hits the value $0$), we conclude that $\\det(\\exp(\\tilde X))\u003e0$. Thus, $\\exp$ is once again not surjective.\nExample 4: $G\\cong \\mathbb R$, the Shift Operators The case for $G=\\mathbb R$ with addition as the group operation seems rather uninteresting at first. Equation $(8)$ becomes\n\\[ \\begin{align} \\gamma ^\\prime(t) = \\tilde X \\end{align} \\] since the differential of the addition operation leaves the vector $\\tilde X\\in T_0 \\mathbb R$ unchanged (after identifying all of the tangent spaces of $\\mathbb R$ with $\\mathbb R$). Thus, $\\gamma(t)=t\\tilde X + C$. Since $\\gamma(0)=0$, we have $C=0$ and $\\gamma(t)=t\\tilde X$. This makes $\\exp(\\tilde X) = \\gamma(1)=\\tilde X$; the boat has moved away from the origin for $1$ unit of time under constant velocity. The same is true for $\\mathbb R^n$, and in fact for any vector space with vector addition as the group operation.\nThe above result becomes interesting when we consider a group isomorphism from $\\mathbb R$ to the space of shift operators . Consider an entirely new group $G$, and let $S^{a} \\in G$ be something that operates on functions of the form $f:\\mathbb R \\rightarrow \\mathbb R$ by shifting them to the left (if ${a} \u003e0$) or right (if ${a}\u003c0$) by ${a}$ units:\n\\[ \\begin{align} (S^{a} f)(t) = f(t + {a}) \\end{align} \\] where ${a} \\in \\mathbb R$. Clearly, $e=S^0$ is the identity element of $G$ and $S^a\\circ S^{-a} = S^0$. I defer the details to a footnote6, but a tangent vector in $T_{S^0}G$ is given by a differential operator of the form $\\tau \\frac{d}{dt}$, where $\\tau \\in \\mathbb R$. The exponential map is then given by\n\\[ \\begin{align} \\exp\\left(\\tau \\frac{d}{dt}\\right) = S^\\tau \\end{align} \\] It cannot be understated just how remarkable the above result is. Letting the left-hand side of $(15)$ operate on a function $f$ and evaluating the resulting function at $t_0$, we get\n\\[ \\begin{align} \\left(\\exp\\left(\\tau \\frac{d}{dt}\\right) f\\right)(t_0) \u0026= \\left[\\left( 1 + \\tau \\frac{d}{dt} + \\frac{\\tau^2}{2!} \\frac{d^2}{dt^2} + \\cdots \\right) f\\right](t_0) \\end{align} \\] whereas on the right-hand side, we have $(S^\\tau f )(t_0) =f(t_0+\\tau)$. Thus,\n\\[ \\begin{align} f(t_0) + \\tau \\frac{df}{dt}(t_0) + \\frac{\\tau^2}{2!} \\frac{d^2f}{dt^2}(t_0) + \u0026\\dots =f(t_0+\\tau) \\end{align} \\] which is nothing but the Taylor series expansion of $f$ at $t_0$! In a sense, the Taylor series expansion starts at $t_0$ and then \u0026lsquo;slides along the graph of $f$\u0026rsquo; to obtain its value at the other points.\nControl theorists should connect the above observations to the discretization of a continuous-time state-space model in the linear time-invariant case. Properties of $\\exp$ We have skipped a lot of the standard results in Lie theory in order to get to the fun parts of this blog post, but the following properties of $\\exp$ are worth mentioning:\nIt is always locally invertible near the identity element $0\\in \\mathfrak g$. Given a choice of $\\tilde X\\in \\mathfrak g$, $\\gamma(t)=\\exp (t \\tilde X)$ is a one-parameter subgroup of $G$, i.e., a Lie group homomorphism from $\\mathbb R$ to $G$. Consequently, $\\exp\\big((t_1+t_2) \\tilde X\\big)=\\exp(t_1\\tilde X) \\exp (t_2 \\tilde X)$, and $\\exp(-\\tilde X)=\\exp (\\tilde X)^{-1}$. $\\exp(t\\tilde X)$ represent geodesic paths passing through the identity of $G$ with respect to a particular choice of metric (namely, a bi-invariant Riemannian metric, if one exists ) and the resulting Levi-Civita connection. The Flows of $\\mathfrak X^{\\mathcal L}(G)$ Before we proceed, we need to see how the one-parameter subgroups $\\exp(t\\tilde X)$ can be extended to flows. A flow of $X\\in \\mathfrak X(G)$ is a map $\\Phi:\\mathbb R \\times G \\rightarrow G$ such that\n\\[ \\begin{align} \\Phi(0,g) = g \\quad \\text{and} \\quad \\frac{d}{ds}\\Big\\vert_{s=t} \\Phi(s,g) = X\\big(\\Phi(t,g)\\big) \\end{align} \\] Notice that $\\Phi(\\cdot , e ) = \\gamma(\\cdot)$ as before, but now we\u0026rsquo;re permitted to place the boat at any point $g\\in G$ at $t=0$ and see how it flows. In this more general setting, the flow map $\\Phi$ is given by\n\\[ \\begin{align} \\Phi(t,g) = \\mathcal R_{\\gamma(t)}(g) \\end{align} \\] where $\\gamma(t) = \\exp(t\\tilde X)$. We say that flows of left-invariant vector fields are given by right-multiplications, and vice versa. As a function of $t$, $(19)$ is a trajectory that flows along the left-invariant vector field $X$ generated by $\\tilde X$, passing through $g$ at $t=0$. Why is that? The velocity vectors along this trajectory are given by\n\\[ \\begin{align} \\frac{d}{ds}\\Big\\vert_{s=t} \\Phi(s,g) \u0026= \\frac{d}{ds}\\Big\\vert_{s=t} \\mathcal R_{\\gamma(s)}(g)\\\\ \u0026= \\frac{d}{ds}\\Big\\vert_{s=t} \\big(g \\odot \\gamma(s)\\big)\\\\ \u0026= \\frac{d}{ds}\\Big\\vert_{s=t} \\mathcal L_g \\gamma(s)\\\\ \u0026= \\big(\\mathcal L_{g*}\\big)_{\\gamma(t)} \\frac{d}{ds}\\Big\\vert_{s=t} \\gamma(s)\\\\ \\end{align} \\] Because $\\gamma(t)$ solves $(8)$, we have\n\\[ \\begin{align} \\frac{d}{ds}\\Big\\vert_{s=t} \\Phi(s,g) \u0026= \\big(\\mathcal L_{g*}\\big)_{\\gamma(t)} X\\big(\\gamma(t)\\big)\\quad \\\\ \u0026= X\\big(g \\odot \\gamma(t)\\big)\\\\ \u0026= X\\big(\\mathcal R_{\\gamma(t)}(g)\\big)\\\\ \u0026= X\\big(\\Phi(t,g)\\big). \\end{align} \\] Lie Bracket The fact that $\\exp (t \\tilde X)$ is a one-parameter subgroup of $G$ means that the corresponding subgroup must be Abelian, i.e., $\\exp (t_1 \\tilde X)$ and $\\exp (t_2 \\tilde X)$ commute under $\\odot$ even if $\\odot$ was not commutative in $G$. For instance, the (non-trivial) one-parameter subgroups of $SO(3)$ are rotations about a fixed axis \u0026ndash; each of these is isomorphic to $SO(2)$, which is Abelian:\n$$R_1,R_2\\in SO(2) \\Rightarrow R_1 \\odot R_2 = R_2 \\odot R_1$$This means that we still haven\u0026rsquo;t captured the (potential) non-commutativity of the group at the Lie algebra. To do this, we first need to understand vector fields as derivations . An operator $X:C^\\infty(G) \\rightarrow C^\\infty(G)$ is called a derivation if it is linear and satisfies the Leibniz rule:\n\\[ X(f_1\\cdot f_2) = f_1\\cdot X(f_2) + f_2\\cdot X(f_1) \\] with $f_1\\cdot f_2$ indicating pointwise multiplication of $C^\\infty(G)$ functions. Vector fields are derivations by construction , though we haven\u0026rsquo;t had to emphasize this aspect of them until now. When $X$ acts on a $C^\\infty(G)$ function $f$, we will treat it as a derivation, but when $g \\in G$, we will treat $X(g)$ as a tangent vector (which acts on one-forms instead).\nFor $X,Y\\in \\mathfrak X(G)$ and $f_1,f_2\\in C^\\infty(G)$, since $Y(f_1\\cdot f_2)$ is again a $C^\\infty(G)$ function, we can have $X$ act on it as follows:\n\\[ \\begin{align} X\\big(Y(f_1\\cdot f_2)\\big) \u0026= X\\big(f_1\\cdot Y(f_2) + f_2\\cdot Y(f_1)\\big)\\\\ \u0026= f_1 \\cdot X\\big(Y(f_2)\\big) + X(f_1)\\cdot Y(f_2) \\nonumber \\\\\u0026\\quad + f_2\\cdot X\\big(Y(f_1)\\big) + X(f_2)\\cdot Y(f_1)\\\\ \\end{align} \\] which indicates that $X\\big(Y(\\ \\cdot\\ )\\big)$ is not a derivation. If we instead define\n$$[X,Y]=X(Y(\\ \\cdot\\ )) - Y(X(\\ \\cdot\\ )),$$then one has (using $(29)$ and its X-Y interchanged version)\n\\[ \\begin{align} [X,Y](f_1\\cdot f_2) \u0026= f_1 \\cdot X\\big(Y(f_2)\\big) + f_2\\cdot X\\big(Y(f_1)\\big)\\\\ \u0026\\quad - f_1 \\cdot Y\\big(X(f_2)\\big) - f_2\\cdot Y\\big(X(f_1)\\big) \\\\ \u0026=f_1\\cdot [X,Y](f_2) + f_2\\cdot [X,Y](f_1) \\end{align} \\] A subtle, yet important point to note is that $X\\big(Y(\\ \\cdot\\ )\\big)$ is not invariant under a change of coordinates, whereas $[X,Y]$ is. This is often the motivation for the definition of $[X,Y]$ that is given in tensor calculus textbooks. making $[X,Y]$ a derivation. One can show that $[X,Y]\\in \\mathfrak X(G)$. Following a similar line of reasoning, the Lie bracket of left-invariant vector fields is a left-invariant vector field. This takes a particularly useful form in matrix Lie groups: due to each left-invariant vector field being uniquely determined by its value at the identity, we can identify the Lie bracket of left-invariant vector fields $X,Y\\in\\mathfrak X^{\\mathcal L}(G)$ with the commutator of matrices:\n\\[ [X,Y](e) = \\tilde X \\star \\tilde Y - \\tilde Y\\star \\tilde X = [\\tilde X,\\tilde Y]_{\\star} \\] where I use $\\star$ to make it explicit that we\u0026rsquo;re relying on matrix multiplication; the $T_e G$ of a non-matrix Lie group does not necessarily come with a $\\star$-like multiplication operation. When writing $[X,Y] (e)$, we are once again interpreting $[X,Y]$ as a vector field $G \\rightarrow TG$ rather than a derivation.\nLastly, we should observe the connection of the Lie bracket to the Lie derivative between vector fields; namely, that they are one and the same. Letting $\\Phi$ be the flow map corresponding to $X$, we have\n\\[ \\begin{align} [X,Y](g) \u0026= \\mathcal L_X Y(g) \\\\ \u0026= \\lim_{\\epsilon \\rightarrow 0} \\frac{\\Big(\\Phi(-\\epsilon, \\cdot\\ )_* Y\\Big)\\big(\\Phi(\\epsilon,g)\\big) - Y(g)}{\\epsilon}\\\\ \u0026= \\frac{d}{dt}\\Big\\vert_{t=0} \\Big(\\Phi(-t, \\cdot\\ )_* Y\\Big)\\big(\\Phi(t, g )\\big) \\end{align} \\] $\\mathcal L_X Y$ is defined such that the vectors being subtracted in the numerator are in the same tangent space. In the very special case where $X,Y \\in \\mathcal X^{\\mathcal L}(G)$, we have\n\\[ \\begin{align} [X,Y](g) \u0026= \\frac{d}{dt}\\Big\\vert_{t=0} \\Big(\\mathcal R_{\\gamma(-t)*} Y\\Big)\\big(\\mathcal R_{\\gamma(t)}(g)\\big)\\\\ \u0026= \\frac{d}{dt}\\Big\\vert_{t=0} \\Big(\\mathcal R_{\\gamma(-t)*} Y\\Big)\\big((g\\odot \\gamma(t))\\big)\\\\ \u0026= \\frac{d}{dt}\\Big\\vert_{t=0} \\Big(\\mathcal R_{\\gamma(-t)*} \\mathcal L_{g*} Y\\Big)\\big(\\gamma(t)\\big)\\\\ \u0026= \\frac{d}{dt}\\Big\\vert_{t=0} \\Big(\\mathcal R_{\\gamma(-t)*} \\mathcal L_{g*} \\mathcal L_{\\gamma(t)*} Y\\Big)\\big(e\\big)\\\\ \u0026= \\frac{d}{dt}\\Big\\vert_{t=0} \\Big(\\mathcal L_{g*}\\textrm{Ad}_{\\gamma(t)} \\tilde Y\\Big) \\end{align} \\] where $\\gamma(t) = \\Phi(t,e) = \\exp(t\\tilde X)$, and we used the fact that $\\mathcal L_{(\\ \\cdot\\ )}$ always commutes with $\\mathcal R_{(\\ \\cdot\\ )}$ (as they act from opposite directions). In particular,\n\\[ \\begin{align} [X,Y](e) = \\frac{d}{dt}\\Big\\vert_{t=0} \\Big(\\textrm{Ad}_{\\gamma(t)} \\tilde Y\\Big) = \\textrm{ad}_{\\tilde X} \\tilde Y \\end{align} \\] where $\\textrm{Ad}_{(\\cdot)}$ and $\\textrm{ad}_{(\\cdot)}$ are the adjoint representations of $G$ and $\\mathfrak g$ (which I will assume you\u0026rsquo;ve seen before).\nRepresentation Theory A representation of $G$ is a Lie group homomorphism from $G$ to $GL(\\mathfrak g)$, where the latter is the group of automorphisms of $\\mathfrak g$ (as a vector space). A Lie group homomorphism need not be particularly instructive, however, since the map $g \\mapsto e$ is also a homomorphism from $G$ to $\\left\\lbrace e\\right\\rbrace$. This is like multiplication by $0$ in a vector space \u0026ndash; it is a linear map, but a rather useless one. A representation is most useful when it is also faithful , meaning that the corresponding Lie group homomorphism is injective/one-to-one.\nGiven a Lie group homomorphism $\\varphi:G\\rightarrow H$, it induces a corresponding Lie algebra homomorphism $\\varphi_*:\\mathfrak g \\rightarrow \\mathfrak h$ that makes the following diagram commute:\nAs implied through $(41)$ and the choice of notation here, $\\varphi_*$ is simply the differential of $\\varphi$ at the identity element of $G$. In particular, the adjoint representations $\\textrm{Ad}$ and $\\textrm{ad}$ are related to each other in this way (see Theorem 8.44 of Lee\u0026rsquo;s book, 2nd edition). Specifically, $\\textrm{ad}:\\mathfrak g \\rightarrow \\mathfrak {gl}(\\mathfrak g)$, where $\\mathfrak {gl}(\\mathfrak g)$ are the endomorphisms of $\\mathfrak g$ (as a Lie algebra).\nEither representation ($\\textrm{Ad}/\\textrm{ad}$) is uninteresting when $\\odot$ is commutative, in which case conjugation reduces to the identity map ($g\\odot h \\odot g^{-1} = h$) and the Lie bracket of $\\mathfrak g$ vanishes identically. However, they are indispensable tools for studying non-commutative groups. In what follows, we will demonstrate yet another line of investigation in which the adjoint representations arise as a measure of non-commutativity.\nMy previous post introduced the notion of a fiber bundle, of which the tangent bundle of $G$, $TG$, is an example. The following diagram shows that two bundles $(E_1, M, \\pi_1, F)$ and $(E_2, M, \\pi_2, F)$ over the same base space ($M$) and fiber ($F$) may be fundamentally different:\nHere, the existence of the homeomorphism $\\varphi: E_1 \\rightarrow M \\times F$ shows that $E_1$ and $M\\times F$ are similar in some sense, and the non-existence of one at $E_2$ shows that it is different from the others. For instance, $E_2$ has only a single, connected edge, whereas the cylindrical shapes have two (top and bottom) edges. Mentally, we can think of a homeomorphism between two spaces as the ability to morph one space (as if it were made of extremely malleable clay) into the other without cutting, gluing, or poking holes into it.\nThe bundle corresponding to $M\\times F$ is called the trivial bundle over $M$ having the fiber $F$, and the homeomorphism $\\varphi$ is called a trivialization of the bundle $(E_1, M, \\pi_1, F)$. By virtue of the existence of $\\varphi$, we will write $E_1 \\cong M\\times F$ (as fiber bundles).\nThe purpose of introducing the notion of a trivial bundle is that we will demonstrate the following fact: $TG \\cong G \\times \\mathfrak g$. We can trivialize $TG$ in the following way; given a vector $X_g \\in T_g G$, we know that $\\mathcal L_{g^{-1}*} X_g$ is in $\\mathfrak g$. Thus, define the following isomorphism between fiber bundles:\n\\[ \\begin{align} \\mathcal L:\\ TG \u0026\\rightarrow G \\times \\mathfrak g\\\\ X_g \u0026\\mapsto (g, \\mathcal L_{g^{-1}*} X_g) \\nonumber \\end{align} \\] The continuity of $\\mathcal L_{g^{-1}*}$ makes the above a valid isomorphism of bundles. If $X\\in \\mathfrak X^{\\mathcal L}(G)$ is a left-invariant vector field that is understood as a smooth section of $TG$, then $\\mathcal L$ flattens the section, i.e., each tangent vector $X(g)$ is mapped to the same vector of $\\mathfrak g$. We call $\\mathcal L$ the left-trivialization of the tangent bundle.\nYet another trivialization is the following:\n\\[ \\begin{align} \\mathcal R:\\ TG \u0026\\rightarrow G \\times \\mathfrak g\\\\ X_g \u0026\\mapsto (g, \\mathcal R_{g^{-1}*} X_g) \\nonumber \\end{align} \\] where $\\mathcal R_g (h) = h \\odot g$ is the right multiplication by $g$. We said that left/right multiplications lead to equivalent theories, only differing by a sign change. That is typically true of Lie theory, but in this particular case, we will make nontrivial observations by considering both $\\mathcal L$ and $\\mathcal R$ simultaneously. Consider what happens when we compose $\\mathcal R$ with $\\mathcal L^{-1}$:\n\\[ \\begin{align} \\mathcal R \\circ \\mathcal L^{-1}:\\ G \\times \\mathfrak g \u0026\\rightarrow G \\times \\mathfrak g\\\\ (g, \\tilde X) \u0026\\mapsto (g, \\mathcal R_{g^{-1}*} \\mathcal L_{g*} \\tilde X) \\nonumber \\end{align} \\] This is a bundle endomorphism of $G\\times \\mathfrak g$ (a homomorphism from it to itself). By construction, it is placing the non-commutativity of $G$ under scrutiny. For each tuple of the form $(g,\\tilde X )$, $\\mathcal R \\circ \\mathcal L^{-1}$ makes $\\tilde X$ take a \u0026lsquo;round-trip\u0026rsquo; by sending it to $T_gG$ via $\\mathcal L_{g*}$ and back to $\\mathfrak g$ via $\\mathcal R_{g^{-1}*}$. Note that $\\mathcal R_{g^{-1}{\\ast}} \\mathcal L_{g{\\ast}} \\tilde X=\\textrm{Ad}_g\\tilde X$. The departure of $\\textrm{Ad}_g\\tilde X$ from $\\tilde X$ is a measure of the non-commutativity of multiplication by $g$. Not all group elements are equally non-commutative; for instance, $e$ commutes with all the other group elements.\nMore precisely, we test for the differentiability of $\\odot$ in the product topology on $G \\times G$.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nThe converse holds if $G$ is simply connected as a manifold (i.e., it has no holes). We say that the Lie group-Lie algebra correspondence is one-to-one in these cases (see the Cartan-Lie theorem ). Note that the $SO(3)$ group is not simply connected; I recommend walking through the proof of this fact in$\\ \\mathrm{Sec.\\ 1.3.4}\\ $of Brian C. Hall\u0026rsquo;s book.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nTypically, $\\mathbf v $ is given to us in the local coordinates of the chart $(U,h)$, as $(h_{\\ast})_p \\mathbf v \\in \\mathbb R^n$. One way to go about computing $(f_{\\ast})_p \\mathbf v$ is to pick any representative curve $\\gamma$ such that $\\frac{d}{dt}\\left[h\\circ \\gamma (t)\\right]\\big\\vert_{t=0} = (h_{\\ast})_p \\mathbf v$. Thereafter, we have $(f_{\\ast})_p \\mathbf v$=$[$$f \\circ \\gamma$$]$. Yet another way to do this computation is to pick a chart $(U^\\prime,h^\\prime)$ at $f(p)$ and determine the Jacobian of $h^\\prime \\circ f \\circ h^{-1}$ at $h(p)$.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nThe word identified is used here in the sense of \u0026lsquo;made identical to\u0026rsquo;. I remember being amused when I first came across this usage of it, now I love how resolute it sounds.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nThis is true for any linear map when $G$ is a vector space; in particular, $L_g$ and $L_{g*}$ are given by the same matrix multiplication operation in matrix Lie groups. See Prop. 3.13 of Lee\u0026rsquo;s book (Second Edition).\u0026#160;\u0026#x21a9;\u0026#xfe0e;\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nLet $\\gamma(a) \\coloneqq S^a$ be a smooth curve on $G$ such that $\\gamma (0) = S^0$. Then $\\frac{d}{da}\\left[S^a f(t)\\right]\\big\\vert_{a=0} = \\lim_{\\Delta a \\rightarrow 0}\\frac{f(t+\\Delta a) - f(t)}{\\Delta a} = \\frac{df}{dt}(t)$. This is for a unit tangent vector; a scaled tangent vector is obtained by considering a curve of the form $\\gamma(a) \\coloneqq S^{\\tau a}$ instead.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://shirazkn.github.io/posts/lie-groups/","summary":"A \u003cspan class=accented\u003etopological group\u003c/span\u003e is a set of elements that has both a group operation and a topology. The group operation satisfies the usual axioms (same as those of finite groups), and the presence of a topology lets us say things like \u0026rsquo;the group is connected\u0026rsquo; and \u0026rsquo;the group operation is continuous'.","title":"The Lie Group-Lie Algebra Correspondence"},{"content":"Over the past year, I have struggled to pin down what the scope of my blog should be. There is plenty of exposition out there on just about every aspect of modern mathematics, but especially on exterior calculus and differential geometry due to their situation at the intersection of several areas in theoretical and applied mathematics. So then what is the scope of my blog? Maybe it is for me to catalog the process of self-learning mathematics as an engineering major who lacks a curricular background in modern mathematics. Maybe it is to assure others like me (who are also privileged enough to learn mathematics in isolation of such material concerns as its \u0026lsquo;job prospects\u0026rsquo;) that it can be done. This post will do a bit of both; it serves in part the purpose of organizing my own thoughts on these matters, and in part the purpose of providing a roadmap for others who are interested in embarking on a similar journey.\nTo appreciate the aspects of vector fields that I will touch upon requires at least a familiarity with the abstract/modern definition of a vector space; for an introduction to the topic that does not rely too heavily on abstract constructions, see Calculus on Manifolds by Spivak. An even more pressing prerequisite is the notion of a smooth (i.e., differentiable) manifold. Unless you are already familiar with manifolds, or choose to push forth so daringly with the intention of learning about manifolds later on, I delegate your introduction of manifolds to The Bright Side of Mathematics , one of my favorite math channels on YouTube. Aspects of topology and differentiability will not show up until the latter half of this discussion.\nPreliminaries The Coordinate-Free Approach The undergrad course on linear algebra at IIT Madras (my alma mater) relies heavily on coordinates. Or maybe it did not emphasize enough the distinction between vectors and their representations via coordinates , so that I was not privy to the underlying structure of vector spaces until very recently. My fond dislike for coordinate-based linear algebra probably stems from this, though I think there\u0026rsquo;s a case to be made for speaking as many different languages as one can. This allows one to see the same mathematical object from different perspectives, not unlike how one walks around a sculpture at an art exhibit, appreciating not only the finer details of the craftsmanship but also the overall intent.\nThe coordinate-based approach to linear algebra begins by choosing (often implicitly) a basis called the standard basis for each vector space that is encountered. It is once we have fixed the choice of the standard basis that we forget (typically for notational convenience) that a basis was ever in need of being chosen. We instead resort to dealing solely with the coordinate representations of the vectors. This is not a bad thing by any means, it provides a two-pronged (algebraic $+$ numerical) framework for analyzing linear transformations by way of matrix multiplication , something that underlies a lot of the modern-day engineering marvels. The case for the coordinate-free approach is then that it reveals the underlying structure of the vector space by breaking free of the shackles of a fixed basis. Indeed, some of the most influential ideas in classical and quantum physics were arrived at by recognizing that no frame of reference is more canonical than the other. The fact that the laws of physics don\u0026rsquo;t change when we switch from one frame of reference to another leads to the famous conservation laws of physics; see Noether\u0026rsquo;s theorem. By requiring that the speed of light remain unchanged while switching frames of reference, Einstein was able to uncover key insights that contributed to his theory of special relativity.1 In a similar vein, by requiring that a vector remain unchanged under a change of basis, we arrive at a rich algebraic and geometric structure that is ripe for mathematical exploration.\nCovariance and Contravariance As Einstein did with the speed of light in his development of special relativity, let us recognize that a vector (not to be confused by the coordinate representation of a vector) will remain unchanged under a change of basis. Letting $\\mathbf v = v^1\\mathbf e_1 + v^2\\mathbf e_2 + \\dots + v^n\\mathbf e_n$ be a vector in an $n$-dimensional real vector space $V$, where $\\mathbf e_1$ through $\\mathbf e_n$ are a basis for $V$, we can introduce the following notation:\n\\[ \\begin{align} \\mathbf v = \\begin{bmatrix} v^1 \u0026 v^2 \u0026 \\dots \u0026 v^n\\end{bmatrix} \\begin{bmatrix} \\mathbf e_1 \\\\ \\mathbf e_2 \\\\ \\vdots \\\\ \\mathbf e_n \\end{bmatrix} \\end{align} \\] Given an invertible matrix $\\mathbf A$, we see that\n\\[ \\begin{align} \\mathbf v \u0026= \\begin{bmatrix} v^1 \u0026 v^2 \u0026 \\dots \u0026 v^n\\end{bmatrix} \\mathbf A^{-1} \\mathbf A \\begin{bmatrix} \\mathbf e_1 \\\\ \\mathbf e_2 \\\\ \\vdots \\\\ \\mathbf e_n \\end{bmatrix}\\\\ \u0026= \\left(\\mathbf A^{-1}\\begin{bmatrix} v^1 \\\\ v^2 \\\\ \\vdots \\\\ v^n\\end{bmatrix}\\right)^\\top \\mathbf A \\begin{bmatrix} \\mathbf e_1 \\\\ \\mathbf e_2 \\\\ \\vdots \\\\ \\mathbf e_n \\end{bmatrix} \\end{align} \\] This hints at a recurring idea in vector and tensor algebra, that the components (or coordinates) of a vector transform in an opposite way to that of the basis vectors. In the language of tensor algebra, we say that the basis vectors transform in a covariant manner, whereas its components transform in a contravariant manner. Here, co- and contra- are supposed to work as antonyms, although beware that the prefix co- is used rather inconsistently2 in the larger context of mathematics and English. As the basis vectors and the components of $\\mathbf v$ both transform in opposite ways, the vector $\\mathbf v$ remains unchanged.3\nThe components of a vector are said to constitute a type $(1,0)$ tensor. In general, the transformation of a type $(k,l)$ tensor is given by a mixture of the above two rules; it has $k$ contravariant and $l$ covariant components. This is what is meant when one says that \"a vector is a contravariant tensor of rank 1.\" This is analogous to what would happen if you were measuring the length of a rubber band with a ruler made of rubber. Suppose you measured its length as $1\\textrm{cm}$ on Tuesday and $2\\textrm{cm}$ on Thursday, what can be deduced about the state of affairs on Thursday? Well, maybe somebody stretched the rubber band to twice its size and held it there, causing the length measurement to increase. Alternatively, they might have painstakingly compressed the ruler to half its length, achieving the same (perceived) increase in the length measurement. These opposing viewpoints of the same phenomenon are ones that we should become comfortable walking back and forth between. It is a manifestation of the idea of duality that shows up quite often in mathematics. While the notion of duality is fundamental enough that it persists even across a coordinate-heavy treatment of vector spaces, it is brought to the forefront in the modern approach to linear algebra.\nDuality As soon as we define a vector space $V$, we receive \u0026lsquo;for free\u0026rsquo; another vector space that is associated with it, denoted as $V^{\\ast}$, which is called the dual space of $V$. Each element of $V^{\\ast}$ is a linear function that maps the vectors of $V$ to its underlying field, which in our case is the real numbers. Thus, if $w:V \\rightarrow \\mathbb R$ and $w\\in V^{\\ast}$, this means that (by the linearity of $w$)\n\\[ \\begin{align} w(\\mathbf v + \\mathbf w) \u0026= w(\\mathbf v) + w(\\mathbf w) \\\\ w(t \\mathbf v) \u0026= t w(\\mathbf v) \\end{align} \\] where $t\\in \\mathbb R$ and $\\mathbf v, \\mathbf w \\in V$. Given $v, w \\in V^{\\ast}$, their addition is defined by the rule $(v+w)(\\cdot) = v(\\cdot) + w(\\cdot)$, and scalar multiplication is defined similarly. Thus, $V^{\\ast}$ is indeed a vector space. The object $w\\in V^{\\ast}$ is called a dual vector, a covector, a linear functional, a linear form, or a one-form . In contrast, the object $\\mathbf v \\in V$ is simply called a vector.\nYou will see a lot of accounts using 'one-form' to mean a differential one-form, but we are not quite there yet. I will distinguish one-forms from differential one-forms. Suppose that the vector space $V$ is endowed with an inner product, $\\langle \\cdot, \\cdot \\rangle:V\\times V \\rightarrow \\mathbb R$. The inner product is by definition (and therefore, by construction) linear in each of its arguments. So, given a vector $\\mathbf v\\in V$, the mapping $\\langle \\mathbf v, \\cdot \\rangle:V \\rightarrow \\mathbb R$ passes all the requirements for its membership in $V^{\\ast}$. Let\u0026rsquo;s define $\\mathbf v^\\flat (\\cdot)=\\langle \\mathbf v, \\cdot \\rangle$ such that $\\mathbf v^\\flat\\in V^*$. The inner product thus induces a mapping $\\mathbf v \\mapsto \\mathbf v^\\flat$ that associates each vector $\\mathbf v$ with a corresponding covector $\\mathbf v^\\flat\\in V^{\\ast}$. Conversely, we can associate to each covector $f$ a vector $f^\\sharp \\in V$, which is the vector that satisfies $\\langle f^\\sharp, \\mathbf v\\rangle = f(\\mathbf v)$.\nOn account of this pairing between vectors and covectors, we say that $V$ and $V^{\\ast}$ are isomorphic (i.e., have the same underlying structure) as vector spaces, and use \u0026lsquo;$\\sharp$\u0026rsquo; and \u0026lsquo;$\\flat$\u0026rsquo; (which are called the musical isomorphisms ) to \u0026rsquo;translate between the languages\u0026rsquo; of the two vector spaces. In high school and undergraduate linear algebra, we might have written $\\mathbf v^\\top$ to refer to the dual version of (i.e., the covector that is paired to) $\\mathbf v$; it helps that \u0026lsquo;covector\u0026rsquo; rhymes with \u0026lsquo;row-vector\u0026rsquo;. The musical isomorphisms generalize the notion of a transpose by permitting it to operate differently in either direction. The musical meaning of \u0026lsquo;$\\sharp$\u0026rsquo; and \u0026lsquo;$\\flat$\u0026rsquo; (raising/lowering a note) is still relevant; it corresponds to the raising/lowering of indices. Take a look again at $(1)$, and see that I have used the choice of index (superscript vs. subscript) to distinguish the contravariant and covariant pieces.\nTopology All of the above was introduced in the absence of a smooth manifold, which suggests that manifolds are yet to contribute something to the discussion. Before we proceed, I\u0026rsquo;d like to list some concepts that I think you should at least skim the Wikipedia pages of (and/or YouTube videos on ) before proceeding further:\nA topological space as a set combined with a collection of open sets The so-called standard topology on $\\mathbb R^n$, defined in terms of open balls/neighborhoods Charts and smooth structures on topological spaces; a cursory understanding of the jargon should be sufficient for our purpose (which is to have fun) Homeomorphisms (and diffeomorphisms), which are continuous (differentiable) maps that have continuous (differentiable) inverses These topics are also covered in the first chapter of Introduction to Smooth Manifolds by John M. Lee.\nAn often overlooked requirement for studying differential geometry is the ability to conceptualize the composition of functions. Given $f_1:A \\rightarrow B$ and $f_2:B \\rightarrow C$, their composition $f_2 \\circ f_1:A \\rightarrow C$ can be read out aloud as \u0026lsquo;$f_2$ after $f_1$\u0026rsquo;. It is the composition of $f_1$ with $f_2$. The order in which $f_1$ and $f_2$ appear in the preceding statements is a source of confusion that can be eliminated by drawing some arrow diagrams. Alternatively, remember that $f_2 \\circ f_1 (\\cdot) = f_2\\left(f_1(\\cdot)\\right)$, so the functions must be ordered such that the image of $f_1$ is contained in the domain of $f_2$. Finally, $C^1(\\mathcal D)$ is the set of all functions that are defined on the domain $\\mathcal D$ whose first derivative is continuous (i.e., are differentiable). A function is said to be smooth if it\u0026rsquo;s in $C^\\infty (\\mathcal D)$, though the distinction between $C^1$ and $C^\\infty$ is not entirely relevant given the rigor of this article.\nTangent \u0026amp; Cotangent Spaces of $\\mathcal M$ Any $n$-dimensional vector space is isomorphic to $\\mathbb R^n$, which is to say that it can only differ superficially from $\\mathbb R^n$. After choosing bases for either space, we are reduced to working with coordinates, and the distinction between the two spaces is lost.\nThis begs the question of what happens in $n$-dimensional spaces that differ from $\\mathbb R^n$ in some meaningful way. Of particular interest to us (and to physicists and roboticists) are spaces that are curved. Einstein was interested in the geometry of a particular curved four-dimensional space (composed of one temporal and three spatial dimensions).4 A prototypical example to keep in mind is the surface of a ball in $\\mathbb R^3$, which is a $2$-dimensional manifold; each point on its surface has a neighborhood that appears to be a $2$-dimensional plane (hence arises the confusion of flat-earthers).\nThe Tangent Space at $p$ An $n$-dimensional manifold has the property that it locally (but not necessarily globally) resembles (i.e., is homeomorphic to) the $n$-dimensional Euclidean space. We will denote throughout an arbitrary manifold by $\\mathcal M$. At each point $p\\in \\mathcal M$, there is (at least one) chart $(U, h)$ such that $U\\subseteq \\mathcal M$ is an open set containing $p$ and $h:U\\rightarrow U^\\prime$ maps the points in $U$ to those in an open set $U^\\prime \\subseteq \\mathbb R^n$. We say that the manifold is differentiable if the given collection of charts constitutes a smooth atlas; a smooth atlas is a collection of charts that are compatible with each other in a way that lets us (unambiguously) walk back and forth between $\\mathcal M$ and its local descriptions in $\\mathbb R^n$ for all our differentiation purposes.\nThe points on $\\mathcal M$ do not constitute a vector space, as we don\u0026rsquo;t necessarily know what it means to \u0026lsquo;add two points on a manifold\u0026rsquo;, and vector spaces must (amongst other things) have some notion of addition. Given a chart $(U, h)$, $h(U)\\subseteq U^\\prime$ is not a vector space either (as it may not be closed under addition; a vector space extends indefinitely in all directions due to scalar multiplication). It will take some more effort to define a vector space at each point $p$ of $\\mathcal M$, which we will call the tangent space of $\\mathcal M$ at $p$, and denoted $T_p \\mathcal M$. A naive (and potentially misleading) picture of the tangent space is that of a plane touching $\\mathcal M$ at $p$. A more rigorous picture is given by the following diagram:\nHere, $\\gamma:[-1,1] \\rightarrow \\mathcal M$ is a differentiable map, in the sense that $h\\circ \\gamma:[-1,1]\\rightarrow \\mathbb R^n$ is differentiable. As we vary $t\\in[-1,1]$, $\\gamma(t)$ traces out a path (or a curve5) on $\\mathcal M$. $f:\\mathcal M \\rightarrow \\mathbb R$ is an arbitrary differentiable function on $\\mathcal M$ (written as $f \\in C^1(\\mathcal M)$) whose arbitrariness will play a key role in our construction of tangent vectors.\nLet $\\gamma(0)=p$ and $\\gamma(\\epsilon)=p^\\prime$ for some $0\u003c\\epsilon \\approx 0$. The idea of \u0026lsquo;differentiation on manifolds\u0026rsquo; is to capture the change in the value of $f$ as we move from $p$ to $p^\\prime$. The choice of the map $\\gamma$ (subject to the constraint that it is differentiable and that $\\gamma(0)=p$) decides how fast, and in what direction $\\gamma(t)$ crosses $p$, leading to the notion of a directional derivative at $p$. If we define a new path $\\tilde \\gamma(t)\\coloneqq\\gamma(2t)$ for $t\\in[-0.5, 0.5]$, i.e., we speed up the evolution of time by a factor of $2$, then $\\tilde \\gamma$ will zip past $p$ twice as fast as $\\gamma$. Indeed, since $$\\tilde \\gamma(\\epsilon/2) = \\gamma(\\epsilon) =p^\\prime,$$ we see that $\\tilde \\gamma$ took only \u0026lsquo;half as much time\u0026rsquo; as $\\gamma$ to get from $p$ to $p^\\prime$. Thus, by allowing time to elapse faster, slower, or backward, we have something resembling the scalar multiplication of vectors.\nHowever, the paths (or functions) $\\gamma$ themselves are not the tangent vectors. This is because the tangent space at $p$ is not concerned with how the paths behave away from $p$, rather, it looks at how the infinitesimal displacements near $p$ change the value of an arbitrary smooth function $f$. A suitable definition of a tangent vector is as the equivalence class of (i.e., the set of all) paths that zip past $p$ with the same direction and speed; each of these paths corresponds to the same tangent vector as the paths are indistinguishable near $p$. Similarly, our usual notion of the tangent of a function (at some point) has an alternative interpretation: it represents the set of all functions whose first-order Taylor series approximations (at the given point) are identical:\nGiven a tangent vector $\\mathbf v\\in T_p \\mathcal M$, we can pick a representative path $\\gamma$ from its equivalence class to give a tangible meaning to $\\mathbf v$. In particular, given any real-valued smooth function $f\\in C^1(\\mathcal M)$, the directional derivative corresponding to $\\mathbf v$ is denoted by\n\\[ \\begin{align} \\mathbf v(f) = \\frac{d}{dt}[f\\circ\\gamma](t)\\Big\\vert_{t=0} \\end{align} \\] By construction, $\\mathbf v(f)$ does not depend on the particular choice of $\\gamma$ from the equivalence class of $\\mathbf v$. Observe also, that $\\mathbf v(f)\\in \\mathbb R$, which suggests that the function $f$ plays a role similar to that of a covector/dual vector.\nThe Cotangent Space at $p$ The fact that $\\mathbf v(f)\\in \\mathbb R$ is to be compared to our earlier observations about covectors. A covector is an object that reduces vectors to real numbers. More formally, it is a linear function whose domain is the entire vector space (in this case, $T_p\\mathcal M$) and whose codomain is the underlying field of scalars (in this case, $\\mathbb R$). So is a function $f\\in C^1(\\mathcal M)$ a covector? Not quite.\nWe could not simply identify the tangent vectors with the paths/curves on $\\mathcal M$ passing through $p$ because it would lead to inadvertent double-counting; two paths that are indistinguishable near $p$ would get counted as two distinct tangent vectors, which is not the case. A similar problem arises if we try to identify the functions $f\\in C^1(\\mathcal M)$ with covectors. Two functions that behave similarly near $p$ would get counted as different cotangent vectors, even though they extract the same measurement out of a given vector.\nNote that we tend to say that covectors measure vectors, which is in direct opposition to the observation that directional derivatives (represented by vectors) operate on functions (represented by covectors). It would seem more natural then to stipulate that 'vectors measure covectors'. Nonetheless, it's a good idea to move past this urge to make sense of the phrasing and instead focus on the underlying mathematical structure. So, we do the same thing as before. Denote by $df$ a cotangent vector, covector, or one-form at $p$, defined as the equivalence class of all the functions that have the same directional derivatives at $p$. In other words, functions $f_1,f_2\\in C^1(\\mathcal M)$ are associated to the same covector $df$ as long as $\\mathbf v(f_1) = \\mathbf v(f_2)$ for all $\\mathbf v \\in T_p \\mathcal M$. Since $df$ is itself something that operates on vectors, we define $df$ such that its operation on the vectors is given by $df(\\mathbf v) = \\mathbf v(f)$, where the $f$ can be any representative from the equivalence class of $df$. The set of all covectors at $p$ make up the cotangent space of $\\mathcal M$ at $p$, which is denoted as $T^{\\ast}_p\\mathcal M$ on account of its duality to $T_p\\mathcal M$.\nFor example, if $\\mathcal M=\\mathbb R$, then $T_p \\mathbb R$ is a one-dimensional vector space that encapsulates how fast we are moving past $p\\in \\mathbb R$, and whether our movement is to the left or to the right; the directions left and right are related by a change of sign. In this case, $T^{\\ast}_p \\mathbb R$ can be identified with the set of all slopes that a Taylor series approximation of some function (at $p\\in\\mathbb R$) could have. Its dimension is also one, as the only choice of freedom here is the slope of the line. Observe that $$ T_p \\mathbb R \\cong T^*_p \\mathbb R \\cong (-\\infty, \\infty) = \\mathbb R. $$ The first isomorphism holds for all (finite-dimensional) tangent spaces, but the second isomorphism generally only holds for Euclidean spaces. It is certainly NOT the case that $T_p \\mathcal M \\cong \\mathcal M$ in general. Before moving on from the above example, reflect upon what a path $\\gamma :[-1,1]\\rightarrow \\mathbb R$ on the manifold $\\mathbb R$ might look like.\nPartial Derivatives and Differentials In the diagram above (which must be consulted repeatedly during the forthcoming discussion), we depict the standard basis on $\\mathbb R^n$ by the black rays. Since $h$ is a homeomorphism, we can use $h^{-1}$ to lift the black rays onto $\\mathcal M$. These gives us \u0026lsquo;coordinate paths\u0026rsquo; on $\\mathcal M$ whose equivalence classes (at each $p\\in U$) may be defined as a basis of $T_p \\mathcal M$ (for each $p\\in U$). The basis thus obtained at $p$ is commonly denoted as $\\big(\\frac{\\partial}{\\partial x^i}\\big\\rvert_p\\big)_{i=1}^n$, whereas $\\big(\\frac{\\partial}{\\partial x^i}\\big)_{i=1}^n$ defines a set of $n$ vector fields on $U$, called as a local coordinate frame. The discussion of frames is deferred to a later post .\nAn arbitrary vector $\\mathbf v \\in T_p \\mathcal M$ can therefore be expressed as\n\\[ \\begin{align} \\mathbf v= v^1 \\frac{\\partial}{\\partial x^1}\\Big\\rvert_p + v^2 \\frac{\\partial}{\\partial x^2}\\Big\\rvert_p + \\dots + v^n \\frac{\\partial}{\\partial x^n}\\Big\\rvert_p, \\end{align} \\] with $v^i\\in \\mathbb R$.\nRemark: Ideally, we should reserve the notation \"$\\partial/\\partial x^i \\rvert_{h(p)}$\" for the standard basis vectors of $T_{h(p)}\\mathbb R^n$, with $h(p) \\in \\mathbb R^n$, and use different notation for the tangent vectors of $T_p\\mathcal M$. The (abuse of) notation I've used follows from that of Lovelock and Rund (p. 334). The distinction between the two is clear once we consider what a directional derivative on a manifold $\\mathcal M$ should do: it's a mapping from $C^1(\\mathcal M)$ to $\\mathbb R$, i.e., it computes the derivative of a $C^1(\\mathcal M)$ function at a given point on $\\mathcal M$, along a prescribed direction. The object ${\\partial/\\partial x^i \\rvert_p}$ is related to \"$ \\partial/\\partial x^i \\rvert_{h(p)}$\" by the pushforward map corresponding to $h^{-1}$. The basis $\\left\\lbrace \\frac{\\partial}{\\partial x^1}\\big\\rvert_p, \\dots, \\frac{\\partial}{\\partial x^n }\\big\\rvert_p\\right\\rbrace$ of $T_p \\mathcal M$ gives rise to a corresponding basis of $T^*_p \\mathcal M$, called the dual basis. The dual basis is denoted as $\\left\\lbrace dx^1_p, \\dots, dx^n_p\\right\\rbrace$ and defined such that the following identity holds:\n\\[ \\begin{align} dx^i_p\\left(\\frac{\\partial}{\\partial x^j}\\Big\\rvert_p\\right) = \\delta_{ij}, \\end{align} \\] where $\\delta_{ij}$ is the Kronecker delta . A covector $df_p \\in T^{\\ast}_p \\mathcal M$ can therefore be represented as\n\\[ \\begin{align} df_p = f_1 dx^1_p + f_2 dx^2_p + \\dots + f_n dx^n_p. \\end{align} \\] The notation that was introduced above starts to have further meaning once we introduce the so-called coordinate functions. Decompose the chart $h$ into its constituent coordinate functions as follows:\n\\[ \\begin{align} h(\\cdot)=\\begin{bmatrix} x^1(\\cdot)\\\\ x^2(\\cdot)\\\\ \\vdots \\\\ x^n(\\cdot) \\end{bmatrix} \\end{align} \\] such that $x^1(p^\\prime), \\dots, x^n(p^\\prime)$ are the coordinates of a point $p^\\prime \\in U$ with respect to the chart $(U, h)$ that contains $p$. Thus, $x^i:U \\rightarrow \\mathbb R$. It appears as if we have defined the basis covector $dx^i_p$ such that\n\\[ \\begin{align} dx^i_p\\Big(\\frac{\\partial}{\\partial x^j}\\Big\\rvert_p\\Big) = \\frac{\\partial}{\\partial x^j}\\Big\\rvert_p\\big(x^i\\small\\big), \\end{align} \\] where the $x^i$ on the right-hand side refers to a function. Thus, $x^i$ is a representative function from the equivalence class of $dx^i_p$ (in fact, this statement holds for all $p\\in U$).\nThe Euclidean Case When $\\mathcal M = \\mathbb R^n$ and we have chosen a basis, there is a canonical inner product on $T_p\\mathbb R^n$ that has a diagonal (identity matrix) structure,\n\\[ \\begin{align} \\langle \\frac{\\partial}{\\partial x^i}\\Big\\rvert_p, \\frac{\\partial}{\\partial x^j}\\Big\\rvert_p\\rangle = \\delta_{ij}, \\end{align} \\] which is colloquially called the standard inner product for $\\mathbb R^n$. Since $dx^i_p\\big(\\frac{\\partial}{\\partial x^j}\\Big\\rvert_p\\big) = \\delta_{ij}$ also, we have in this case,\n\\[dx^i_p = \\frac{\\partial}{\\partial x^i}\\Big\\vert_p^{\\flat}, \\quad {dx^i_p}^\\sharp = \\frac{\\partial}{\\partial x^i}\\Big\\vert_p. \\] We also have the following convenient fact that follows from the bilinearity of the inner product: \\[ \\begin{align} \\langle \\frac{\\partial}{\\partial x^i}\\Big\\rvert_p, \\mathbf v\\rangle = \\sum_{j=1}^{n}v^j \\langle \\frac{\\partial}{\\partial x^i}\\Big\\rvert_p, \\frac{\\partial}{\\partial x^j}\\Big\\rvert_p\\rangle = v^i \\end{align} \\] This is analogous to how a row vector such as $\\begin{bmatrix}0 \u0026 \\cdots\u0026 0 \u0026 1 \u0026 0 \u0026 \\cdots \\end{bmatrix}$ picks out the corresponding component of a vector that it operates on. Similarly (by the linearity of \u0026lsquo;$\\sharp$\u0026rsquo;),\n\\[ \\begin{align} \\langle \\frac{\\partial}{\\partial x^i}\\Big\\rvert_p, df^\\sharp_p\\rangle = f_i. \\end{align} \\] Using simple manipulations, we can offer alternative interpretations to the previous equation:\n\\[ \\begin{align} \\langle \\frac{\\partial}{\\partial x^i}\\Big\\rvert_p, df^\\sharp_p\\rangle \u0026= \\langle df^\\sharp_p, \\frac{\\partial}{\\partial x^i}\\Big\\rvert_p\\rangle \\\\ \u0026= ({df_p^\\sharp})^\\flat \\left(\\frac{\\partial}{\\partial x^i}\\Big\\rvert_p\\right)\\\\ \u0026= {df}_p \\left(\\frac{\\partial}{\\partial x^i}\\Big\\rvert_p\\right)\\\\ \u0026= \\frac{\\partial}{\\partial x^i}\\Big\\rvert_p(f) = f_i. \\end{align} \\] Now behold, for something striking will occur when we combine (9) and (18):\n\\[ \\begin{align} df_p \u0026= \\sum_{i=1}^n f_i dx^i_p \\\\ \u0026= \\sum_{i=1}^n \\frac{\\partial f}{\\partial x^i}\\Big\\rvert_p dx^i_p \\end{align} \\] This is nothing but the multivariable chain rule, which relates the differential of a function to infinitesimal changes in its arguments.\nIt should be noted that the measurement of a vector by a covector can be defined in the absence of an inner product; even the dual basis of covectors is defined uniquely. However, we do need an inner product to be able to make sense of the musical isomorphisms. Simply because the musical isomorphisms by definition refer to the pairing between vectors and covectors induced by the inner product.\nThe Non-Euclidean Case While it\u0026rsquo;s always possible to choose a chart $(U, h)$ containing $p$ such that $\\frac{\\partial}{\\partial x^i}\\big\\rvert_p$ is orthonormal, it is seldom possible to choose an \u0026lsquo;orthonormal coordinate system\u0026rsquo; whose coordinate vector fields are orthonormal at all points of $U$. Alternatively, the existence of such a chart would imply that the manifold is of a very special type: it is flat. Cylinders, cones, and other objects that can be wrapped with a sheet of paper without tearing, wrinkling, or folding the paper are examples of flat manifolds. I touch upon this point in a later post . Such considerations will only arise when we try to extend the basis of $T_p \\mathcal M$ into a frame, so the reader need not dwell on this point too much right now, and may skip comfortably to the next section.\nA Riemannian metric $g_{(\\cdot)}$ is an object that takes as input a point $p\\in \\mathcal M$ to become an inner product on $T_p \\mathcal M$, written as $g_{p}(\\cdot, \\cdot)$ or $\\langle \\cdot,\\cdot\\rangle_p$ depending on the author. What we have done above is to stipulate that $g_{p}(\\frac{\\partial}{\\partial x^i}, \\frac{\\partial}{\\partial x^j})=\\delta_{ij}$. More generally, one introduces a convenient coordinate system on $U\\subseteq \\mathcal M$ (for instance, the spherical polar coordinates of a sphere ) and then specifies (or when there is some natural choice of metric, computes) the \u0026lsquo;components of the metric tensor\u0026rsquo;, collected into a matrix of numbers, $g_{ij}(p)$. Each of these components is required to be a smooth function of $p$ \u0026ndash; a requirement that marries the differential structure of $\\mathcal M$ to its geometric structure. In the general case, we write\n\\[ \\big\\langle\\frac{\\partial}{\\partial x^i}, \\frac{\\partial}{\\partial x^j}\\big\\rangle _p=g_{ij}(p) \\] When stored as the entries of a matrix, $g_{ij}(p)$ are related to the Jacobian matrix of the parameterization evaluated at $p$, and the corresponding determinant tells us how much the parameterization stretches or squishes the space near $p$. Letting $g^{ij}(p)$ denote the entries of the inverse matrix, the musical isomorphisms are computed as follows:\n\\[ \\begin{align} \\frac{\\partial}{\\partial x^i}\\Big\\rvert_p^\\flat = g_{ij}(p)\\,dx^j_p, \\quad {dx^i_p}^\\sharp = g^{ij}(p)\\frac{\\partial}{\\partial x^j}\\Big\\rvert_p, \\end{align} \\] These operations are sometimes referred to as lowering and raising of index (respectively) for obvious reasons. There are other ways to compute $g_{ij}(p)$; for instance, when the manifold is not parameterized but given by an implicit equation (such as \u0026lsquo;$x^2 + y^2 + z^2=1$\u0026rsquo;), it is not clear which Jacobian must be evaluated. See the books by G. S. Chirikjian for more details about these computations.\nIt is also possible to define a non-standard Riemannian metric for $\\mathbb R^n$ which is nonetheless \u0026ldquo;Euclidean\u0026rdquo;, i.e., represents a \u0026ldquo;flat\u0026rdquo; space. This is the case when the coefficients $g_{ij}(p)$ are each constant as functions of $p$, for instance.\nTangent/Cotangent Bundles A fiber bundle is obtained when at each point on a space, a fiber is attached. This is sort of like how hair grows all over your (or at least my South Indian) skin, in which case one of the strands of hair (serving as a representative example of the other strands) gets called the fiber, and the surface of the skin is called the base space. More rigorously, we refer to the following picture from Wikipedia of a fiber bundle:\nHere, the base space $M$ is a set of points. The squiggly rectangle $E$ is called the total space, and the fiber is a vertical line (not shown). By affixing a copy of the fiber (the vertical line) at each point on $M$ ($E_{m_1}$ on $m_1$, $E_{m_2}$ on $m_2$, and so on) in a particular way, we obtain the total space $E$. Here, $\\pi:E\\rightarrow M$ is called the projection, which takes a point on the total space to the point on the base space to which a fiber was attached; $s:M\\rightarrow E$ is called a section (or more intuitively, a cross-section). The section is a smooth function that chooses at each point of $M$ (e.g., $m_1, m_2, \\dots$) a single point on its fiber (e.g., $s(m_1) \\in E_{m_1}$, $s(m_2)\\in E_{m_2}$, and so on).\nActually, there is some subtlety involved in the definition of a fiber bundle which I have brutalized in the hasty discussion above. Since it does not matter too much for our present purposes, I will let you learn the rigorous definition of a fiber bundle in your own time. A similar characterization allows us to introduce the notion of the tangent bundle of $\\mathcal M$, denoted as $T\\mathcal M$. The base space is $\\mathcal M$, and the object attached at point $p\\in \\mathcal M$ is the tangent space $T_p \\mathcal M$. The projection $\\pi:T\\mathcal M \\rightarrow \\mathcal M$ is defined as $\\pi(\\mathbf v) = p$ where $\\mathbf v \\in T_p \\mathcal M$. Similarly, we can define the cotangent bundle of $\\mathcal M$, and denote it as $T^{\\ast}\\mathcal M$.\nA section of $T\\mathcal M$ is a map that assigns to each point $p\\in \\mathcal M$ a tangent vector $\\mathbf v \\in T_p \\mathcal M$ (refer to the diagram above). A (smooth) section of the tangent bundle is called a (smooth) vector field. A (smooth) section of the cotangent bundle is called a differential one-form. A more intuitive name for a differential one-form is a \u0026lsquo;smooth one-form field\u0026rsquo;, but the less intuitive name prevails in the literature, so we might as well acquaint ourselves with it.\nNow to unpack the newly introduced objects. A smooth vector field is a function $X:\\mathcal M \\rightarrow T\\mathcal M$ such that $X(p)\\in T_p\\mathcal M$. One can imagine $X$ as representing the motion of water on $\\mathcal M$. After dropping a paper boat at some point on $\\mathcal M$, we may sit back and observe where the motion of the water takes it. Once the boat traces out a path on $\\mathcal M$, the resulting path is called an integral curve or (rather suitably) a flow on $\\mathcal M$ generated by $X$.6 The set of smooth vector fields is itself a vector space, and is denoted as $\\mathfrak X(\\mathcal M)$. I\u0026rsquo;ll let you check that the addition and scalar multiplication operations can be defined suitably. It is not so obvious what a basis for $\\mathfrak X(\\mathcal M)$ should be; it\u0026rsquo;s infinite-dimensional as a vector space (think about how one might show this!).\nA differential one-form $\\alpha:\\mathcal M \\rightarrow T^{\\ast}\\mathcal M$ maps each point to a covector (i.e., a measuring stick) on that fiber. This is like placing a \u0026lsquo;speed sensor\u0026rsquo; at every point of the manifold on which water is flowing. The orientation and calibration parameters of the sensors vary smoothly as one moves about the manifold. Note that if $p^\\prime \\neq p$, then a covector $\\alpha(p^\\prime) \\in T^{\\ast}_{p^\\prime} M$ cannot measure a vector $X(p)\\in T_{p} M$, as there is no relationship between these vector spaces in general. In other words, each of the speed sensors has nothing to say about the flow of water at some distance away from it.\nWith some reflection, it should be obvious how we should define the operation of a differential one-form on a vector field. At each point $p\\in \\mathcal M$, the covector $\\alpha(p)$ measures the vector $X(p)$ to produce a real number. The resulting scalar-valued function \u0026lsquo;$\\alpha(\\cdot)\\left(X(\\cdot)\\right)$\u0026rsquo; is called a differential $0$-form on account of it being a scalar-valued function.\nPushforwards and Pullbacks The covariance and contravariance of vectors and covectors can be related to the notion of a dual category in category theory . The dual or opposite category is obtained by reversing all of the arrows of the original category. It is the category-theoretic notion of duality, whereas the vector space version that we observed above is a special case. The reversal of arrows explains $(1)$-$(3)$. It also explains how pushforwards and pullbacks should work.\nWhen we have a map $\\varphi:\\mathcal M \\rightarrow \\mathcal N$ between two manifolds, we can define the pushforward of a vector field $X\\in \\mathfrak X(\\mathcal M)$ as the vector field $\\varphi_* X \\in \\mathfrak X (\\mathcal N)$ that is induced by $\\varphi$. The idea is that the domain of $X$ is pushed forward by $\\varphi_*$. I\u0026rsquo;ll let you figure out how the pushforward vector field should be defined; when in doubt, use the \u0026rsquo;equivalence classes of paths\u0026rsquo; identification of the tangent vectors. The duality of vector fields and covector fields suggests that differential one-forms (which are covector fields) can be pulled back. Indeed, a differential one-form $\\alpha$ on $\\mathcal M$ can be pulled back by $\\varphi$ to yield a differential one-form $\\varphi^{\\ast}\\alpha$ on $\\mathcal N$, which is called as the pullback of $\\alpha$ under $\\varphi$. Again, I leave you to deduce how the pullback should be defined (Hint: Draw diagrams as I did, and try composing the functions available at hand).\nAdditional Structure Our discussion brings us to a juncture that forks into two different topics in math: parallel transport and exterior calculus (this distinction is neither sharp nor exhaustive). I don\u0026rsquo;t have anything specific to say about these areas that\u0026rsquo;s particularly insightful or conveys the ideas any better than has been done already, so I will instead leave you with some resources. The two topics can be approached in either order.\nParallel Transport We observed that tangent spaces at different points on $\\mathcal M$ have (in general) no relationship to each other. This gives us no way to distinguish, for instance, a curved and swirly vector field from a straight, streamlined, and \u0026lsquo;parallel\u0026rsquo; one. To be able to do this, we need to be able to compare vectors at different points on $\\mathcal M$. Specifically, we would like to differentiate along vector fields by doing something like \u0026lsquo;$X(p) - X(p+\\Delta p)$\u0026rsquo;, but this fails spectacularly because (in general) (i) points on $\\mathcal M$ cannot be added, and (ii) vectors from $T_p\\mathcal M$ and $T_{p^\\prime}\\mathcal M$ cannot be added.\nAnd yet, in the analogy of a boat moving on a manifold on which water flows (as described by the vector field), the boat seems to turn left and right depending on the flow of the water. You might then intuit that we can distinguish a swirly vector field from a streamlined one by sitting on the boat and measuring the inertial forces that are felt.\nThis is because in our mental picture we are implicitly endowing the manifold with an additional piece of structure called a connection (or equivalently, a covariant derivative ), called as such because it \u0026lsquo;connects\u0026rsquo; the adjacent vector spaces of the manifold in a meaningful way. The idea of introducing a connection is to (amongst other things) differentiate one vector field along another. It also allows us to describe (at least locally) which path on the manifold represents the shortest distance between two given points. Such paths are called geodesics; airplanes fly on the geodesics of Earth to optimize their fuel consumption, light travels along geodesics (subject to the influence of the curvature of spacetime) leading to the apparent bending of the light that gets too close to a heavy celestial object. I recommend this excellent Dover book by Lovelock and Rund, as well as the YouTube videos by eigenchris . I\u0026rsquo;m also a fan of Timothy Nguyen\u0026rsquo;s videos.\nExterior Calculus The idea of exterior calculus is to introduce integration on manifolds. While we got a glimpse of how this can be done in the case of differential one-forms, particular emphasis is placed on creating higher dimensional objects called differential $k$-forms . Differential $k$-forms are built out of differential one-forms using a variety of operations that includes the wedge product, the exterior derivative, and the hodge-star operator.\nThe book by Lovelock and Rund develops a coordinate-heavy theory of differential $k$-forms via tensors. I read the first 4 or 5 chapters of it before realizing that I was getting more out of the much shorter, coordinate-free treatment that was given in the appendix. Also of note is this video series by Keenan Crane (a professor of computer science at CMU), which is one of my favorite pieces of mathematical exposition that I have seen in a while, covering everything you\u0026rsquo;d need to go from having a superficial understanding of vector spaces to doing exterior calculus on computers. It maintains a coordinate-free viewpoint in the sense that it does not reduce the discussion to the algebra of tensors. I highly recommend giving Terrence Tao\u0026rsquo;s notes on differential forms a read as well; it\u0026rsquo;s not an exhaustive treatment, but Tao\u0026rsquo;s insights are priceless. It would be remiss not to mention the generalized Stokes\u0026rsquo; theorem, which is arguably the crown jewel of exterior calculus. This video by Aleph 0 provides a great summary of it.\nAs the story goes, he once saw a painter fall off of the roof of a building. By considering the inertial forces (or the lack thereof) perceived in the painter\u0026rsquo;s frame of reference during the fall, he inched closer to his invention of the general theory of relativity . Einstein later described this observation as the \u0026lsquo;happiest thought of his life\u0026rsquo;.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nSee the answer by Kevin Lin here .\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nIf you\u0026rsquo;re familiar with basic category theory , connect this to the concept of an opposite category .\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nActually, the spacetime described by Einstein\u0026rsquo;s theory of special relativity is not curved, but flat, and is called Minkowski spacetime . The curvature of spacetime is introduced in the theory of general relativity, which incorporated the effects of gravity.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nTerrence Tao\u0026rsquo;s notes say that paths and curves differ in a pathological sense (pun intended), but we can use them interchangeably without losing out on much.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nThe hairy ball theorem would be a fun tangent (pun intended) to go on at this point.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://shirazkn.github.io/posts/vector-fields/","summary":"Over the past year, I have struggled to pin down what the scope of my blog should be. Maybe it is for me to catalog the process of self-learning mathematics as an engineering major who lacks a curricular background in modern mathematics. This post serves in part the purpose of organizing my own thoughts on these matters, and in part the purpose of providing a roadmap for others who are interested in embarking on a similar journey.","title":"Vector Fields on Manifolds"},{"content":"The Fourier transform takes a (absolutely integrable) function $f:\\mathbb R \\rightarrow \\mathbb R$ and outputs a different (possibly complex-valued) function. If the first is interpreted as a signal (e.g., the waveform of an audio that is parameterized by time), then its Fourier transform has its \u0026lsquo;peaks\u0026rsquo; at the dominant frequencies of the signal. I will not expound too much on the Fourier transform itself, but its computation looks something like this1:\n$$ \\mathcal F [f](\\omega) \\coloneqq \\int_{-\\infty}^\\infty f(t) e^{-i \\omega t} dt $$which we will also write as $\\hat f(\\omega) \\coloneqq \\mathcal F [f](\\omega)$. Here\u0026rsquo;s what the Fourier transform of a rectangular \u0026lsquo;pulse\u0026rsquo; signal looks like; the one on the right is called as the $\\textrm{sinc}$ function:\nOn the other hand, we also have something called the Fourier series, which takes a periodic function $f$ and gives a summation rather than an integral. Let me not present the formula for the Fourier series right away. Instead, I want to derive it using the Fourier transform for anyone who is interested in connecting the two concepts with each other.2 We will also avoid any of the pathologies related to distributions (which is why Wikipedia hesitates to say much on the matter).\nPeriodic Functions Given a periodic function $\\tilde f(t)$ having a period $T\u003e0$, i.e.,\n$\\dots =\\tilde f(t-T) = $$\\tilde f(t)$ $=\\tilde f(t+T)=\\tilde f(t+2T)=\\dots$, we can always re-parameterize it as $f(t)\\coloneqq \\tilde f(\\frac{T}{2 \\pi} t)$, so that $f$ has a period of $2\\pi$. Hence, we typically bring functions to this \u0026lsquo;standard form\u0026rsquo; before taking the Fourier transform, causing the period $T$ to disappear from the Fourier transform formulas and properties.\nAnother way of viewing a $2\\pi$-periodic function is as a mapping $f:\\mathbb R/2\\pi \\mathbb Z\\rightarrow \\mathbb R$. The object $\\mathbb R/2\\pi \\mathbb Z$ is called a quotient group ; it is the object you get when you take $\\mathbb R$ and \u0026lsquo;glue\u0026rsquo; $0$ to $2\\pi$ such that it forms a circle. It\u0026rsquo;s easy to see why a function \u0026lsquo;defined on a circle\u0026rsquo; can be interpreted as a periodic function. The theory of Pontryagin duality extends the Fourier transform to the domain $\\mathbb R/2\\pi \\mathbb Z$ (which gives us the Fourier series), as well as to other domains of interest like the integers and their quotient groups (which give us the discrete-time Fourier transforms of aperiodic and periodic signals, respectively).\nNevertheless, in what follows, we will treat $f$ as a periodic function whose domain is $\\mathbb R$.\nFourier Series Given a set of numbers $\\lbrace c_k\\rbrace _{k\\in\\mathbb Z}$, $c_k\\in \\mathbb C$, the following is a $2\\pi$-periodic function:\n$$f(t) \\coloneqq \\sum_{k=-\\infty}^{\\infty}c_k \\hspace{2pt}e^{i kt} $$since $e^{i kt} = e^{i k(t+2\\pi)}$ whenever $k\\in \\mathbb Z$. In fact, any $2\\pi$-periodic function (and by extension, any $T$-periodic function which is square-integrable ) can be expressed in the form shown above, which is called as its Fourier series representation. Take a second to observe how remarkable this is; at first, it appears as if in order to specify an arbitrary real-valued periodic function $f(\\cdot)$, we would need to specify uncountably many real numbers, $\\lbrace f(t) \\rbrace_{t\\in\\mathbb R}$. But the Fourier series representation says that only countably many real numbers are needed to specify $f(\\cdot)$, namely3\n$$\\Big\\lbrace \\textrm{Re}(c_k), \\textrm{Im}(c_k) \\Big\\rbrace_{k\\in\\mathbb Z}.$$If you\u0026rsquo;re anything like me, you might think that the Fourier series representation should arise as a special case of the Fourier transform. After all, periodic functions are just ordinary functions that satisfy an additional (rather special) property. Well\u0026hellip; almost! For the Fourier transform to exist, the function in question should be absolutely integrable (i.e., $|f(t)|$ should integrate to a finite value), but this is clearly not the case when $f(t)$ is periodic and takes non-zero values.\nFourier Transform $\\rightarrow$ Fourier Series Let $f(t)$ be a $2\\pi$-periodic function. We want to make sense of what its Fourier transform (if it were well-defined) would even look like. Ultimately, we hope that it will coincide with its Fourier series representation in some sense. It would be absolutely horrifying if they turned out to be two completely unrelated concepts. Thankfully, we will see that\n$$``\\mathcal F^{-1}\\Big[\\mathcal F\\big[ f\\big]\\Big] (t)\"$$does end up looking like a Fourier series representation of $f$ (where we used the quotes to remind ourselves that the usual Fourier and inverse Fourier transforms aren\u0026rsquo;t defined when $f(t)$ has a finite period). In fact, it will turn out that $\\mathcal F[f](\\omega)$ has the following form4:\nThe plot on the right looks somewhat like a Dirac comb , which is an evenly spaced \u0026lsquo;comb\u0026rsquo; or a \u0026rsquo;train\u0026rsquo; of Dirac delta functions, whose spacing corresponds to the \u0026rsquo;natural frequency\u0026rsquo; of the signal.\nApproach #1: There\u0026rsquo;s multiple ways of doing this. A very mechanical way of getting from the Fourier transform to the Fourier series is by approximating the original function with one that is absolutely integrable, as follows:\n\\[ \\begin{align*} \\hat f_0(\\omega) \u0026= \\lim_{\\epsilon \\rightarrow 0} \\mathcal F \\big[ f(t) e^{-\\epsilon t^2} \\big] \\\\ \u0026= \\lim_{\\epsilon \\rightarrow 0} \\int_{-\\infty}^\\infty f(t) e^{-\\epsilon t^2} e^{-i\\omega t} dt \\end{align*} \\] where the $e^{-\\epsilon t^2}$ term tapers off the plot of $f(t)$ in either direction ($t\\rightarrow \\infty$ and $t\\rightarrow -\\infty$), making the function absolutely integrable. As $\\epsilon \\rightarrow 0$, the factor $e^{-\\epsilon t^2}$ disappears, and we should recover the Fourier series representation of $f(t)$ by expressing it as $\\mathcal F^{-1}\\big[ \\hat f_0 \\big] (t)$. In fact, as $e^{-\\epsilon t^2}$ disappears, the function becomes truly periodic, and its Fourier transform tends towards non-existence. This coincides with the appearance of the Dirac delta functions in the Fourier domain, which are technically not functions, but distributions .\nApproach #2: We can also use the Fourier transform tables and their properties to do this. These tables and properties (which are easily found through a quick Google search) ignore the intricacies related to distributions (in particular, the Dirac delta function) and that\u0026rsquo;s what we\u0026rsquo;ll do too. So in the true spirit of engineering, let\u0026rsquo;s assume that $\\hat f(\\omega)\\coloneqq\\mathcal F\\big[f\\big](\\omega)$ is well defined. We have,\n$$ \\hat f(\\omega) = \\int_{-\\infty}^\\infty f(\\tau) e^{-i\\omega \\tau} d\\tau, $$where we use the notation \u0026lsquo;$\\tau$\u0026rsquo; to prevent its confusion with the variable $t$ used below. Due to the periodicity of $f(t)$, we can decompose this as\n$$ \\hat f(\\omega) = \\sum_{k=-\\infty}^{\\infty}\\int_{0}^{2\\pi} f(\\tau) e^{-i\\omega (2\\pi k + \\tau)} d\\tau, $$Good, now we have the \u0026lsquo;summation\u0026rsquo; we need for the Fourier series. Since $f(t) = \\mathcal F^{-1}\\big[\\hat f \\hspace{1pt}\\big] (t) $, we can write\n$$ f(t) = \\frac{1}{2\\pi} \\int_{-\\infty}^\\infty \\left[\\sum_{k=-\\infty}^{\\infty}\\int_{0}^{2\\pi} f(\\tau) e^{-i\\omega (2\\pi k + \\tau)} d\\tau \\right] e^{i\\omega t}d\\omega $$where the factor $1/2\\pi$ is a standard feature of Fourier analysis; it shows up when we do $\\mathcal F^{-1}\\big[\\mathcal F[\\ \\cdot\\ ]\\big]$. Since we\u0026rsquo;re both engineers, we will swap the integrals and summation around (in actuality, we need to check certain convergence and \u0026lsquo;regularity\u0026rsquo; conditions before doing this):\n\\[ \\begin{align*} f(t) \u0026= \\sum_{k=-\\infty}^{\\infty}\\int_{0}^{2\\pi} f(\\tau) \\left[ \\frac{1}{2\\pi} \\int_{-\\infty}^\\infty e^{-i\\omega (2\\pi k + \\tau)} e^{i\\omega t} d\\omega \\right] d\\tau\\\\ \\end{align*} \\] Focusing on the term inside the \u0026lsquo;$[\\ \\cdot\\ ]$\u0026rsquo;, and letting $a \\coloneqq 2\\pi k + \\tau$, we have\n$$ \\frac{1}{2\\pi} \\int_{-\\infty}^\\infty e^{-i\\omega a} e^{i\\omega t} d\\omega $$Notice that this is the inverse Fourier transform of some function whose Fourier transform is $e^{-i\\omega a}$. Here\u0026rsquo;s where we use the Fourier transform table as well as the \u0026ldquo;time-shift\u0026rdquo; property of Fourier transform pairs to see that\n$$ \\frac{1}{2\\pi}\\int_{-\\infty}^\\infty e^{-i\\omega a} e^{i\\omega t} d\\omega = \\delta (t - a) = \\delta(t - \\tau - 2 \\pi k) $$Plugging this back in,\n\\[ \\begin{align*} f(t) \u0026= \\sum_{k=-\\infty}^{\\infty}\\int_{0}^{2\\pi} f(\\tau) \\delta(t - \\tau - 2 \\pi k) d\\tau\\\\ \u0026= \\sum_{k=-\\infty}^{\\infty}\\int_{0}^{2\\pi} f(\\tau) \\delta\\big(\\tau - (t - 2 \\pi k) \\big) d\\tau \\\\ \u0026= \\int_{0}^{2\\pi} f(\\tau) \\sum_{k=-\\infty}^{\\infty}\\Big[\\delta\\big(\\tau - (t - 2 \\pi k) \\big)\\Big] d\\tau \\end{align*} \\] For any value of $t$ chosen on the left hand side, there exists a $k_0\\in \\mathbb Z$ such that\n\\[ \\begin{align*} f(t) \u0026= \\int_{0}^{2\\pi} f(\\tau) \\delta\\big(\\tau - (t - 2 \\pi k_0) \\big) d\\tau\\\\ \u0026= f(t - 2 \\pi k_0) = f(t) \\end{align*} \\] This is because $\\sum_{k=-\\infty}^{\\infty}\\delta\\big(\\tau - (t - 2 \\pi k) \\big)$ is a \u0026lsquo;comb\u0026rsquo; of Dirac delta functions that are spaced $2\\pi$ apart from each other, so that only one of these Dirac deltas is picked out by the integral $\\int_0^{2\\pi} \\star\\ d\\tau$.\nBut wait a second! We went around in a circle (pun intended) to show that $f(t) = \\mathcal F^{-1}\\Big[\\mathcal F[f] \\Big](t)$. Well, at least now we know that the calculation that we did above checks out, even though we are scarcely able to define $\\mathcal F[f] (\\omega)$ and despite the fact that we swapped summations and integrals around like it\u0026rsquo;s nothing!\nApproach #2 (Revisited): Let\u0026rsquo;s go back to the equation\n\\[ \\begin{align*} f(t) \u0026= \\int_{0}^{2\\pi} f(\\tau) \\sum_{k=-\\infty}^{\\infty}\\Big[\\delta\\big(\\tau - (t - 2 \\pi k) \\big)\\Big] d\\tau \\end{align*} \\] and use the Poisson summation formula (for now, we think of it as a magic trick), which allows us to say that\n$$ 2 \\pi \\sum_{k=-\\infty}^\\infty \\delta (\\tilde \\tau + 2 \\pi k) = \\sum_{k=-\\infty}^\\infty e^{-ik\\tilde \\tau} $$We will unpack this strange result shortly; for now, let\u0026rsquo;s apply it to our expression above (with $\\tilde \\tau=\\tau - t$)\n\\[ \\begin{align*} f(t) \u0026= \\frac{1}{2\\pi} \\int_{0}^{2\\pi} f(\\tau) \\sum_{k=-\\infty}^{\\infty}e^{-ik(\\tau - t)} d\\tau\\\\ \u0026= \\sum_{k=-\\infty}^{\\infty} \\left[\\frac{1}{2\\pi} \\int_{0}^{2\\pi} f(\\tau) e^{-ik\\tau} d\\tau \\right] e^{ikt} \\end{align*} \\] which is just the Fourier series representation of $f(t)$, with the coefficients $c_k$ equated to the integral inside $\\left[\\ \\cdot\\ \\right]$. Thus, the bridge between the Fourier transform and the Fourier series hinges on the Poisson summation formula.\nThe Poisson Summation Formula So why did we need to rely on this strange result, that\n$$ 2 \\pi \\sum_{k=-\\infty}^\\infty \\delta (\\tilde \\tau + 2 \\pi k) = \\sum_{k=-\\infty}^\\infty e^{-ik\\tilde \\tau}\\ \\qquad (*) $$to show that the Fourier transform resolves to the Fourier series in the case of a periodic function? Observe that the function $\\sum_{k=-\\infty}^\\infty\\delta (\\tilde \\tau + 2 \\pi k)$ is a Dirac comb , which is a series of \u0026rsquo;needles\u0026rsquo; (Dirac delta functions) spaced apart by a distance of $2\\pi$. Its Fourier transform is also a Dirac comb (see this blog post ; another way to see this is to consider the Dirac comb as a sampled/\u0026lsquo;discrete-time\u0026rsquo; periodic signal whose discrete-time Fourier transform is being sought). Equating $\\sum_{k=-\\infty}^\\infty\\delta (\\tilde \\tau + 2 \\pi k)$ to its corresponding inverse Fourier transform gives us $(*)$. The inverse Fourier transform integrates over the Dirac comb, and the comb turns the integral into a summation.\nEquation $(*)$ is a special case of the Poisson summation formula (which is in turn a special case of the convolution theorem ) applied to the Dirac comb. Proofs of the summation formula can be found on its Wikipedia page , though I haven\u0026rsquo;t looked at it hard enough to see if there\u0026rsquo;s an intuitive explanation of it that warrants repeating here. We could always prove $(*)$ using Approach #1 outlined above. Ultimately, we have only swept the problem under the rug. We still don\u0026rsquo;t understand how to re-discover the Fourier series using the Fourier transform, except by working through $(*)$ first. At least this appears to be a less formidable challenge, one that I might have more to say about in the future. I suspect that a complete theory of the relationships between the various types of Fourier analyses is given by Pontryagin duality , although a more immediate and accessible explanation of it would be nice.\nUnfortunately, there are at least six different conventions for how the Fourier transform and its inverse are defined. We place the factor of $1/2\\pi$ on the inverse Fourier transform, and we assume the function to be $2\\pi$-periodic. The same convention is followed here .\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nBy the way, it is also possible to derive the Fourier transform from the Fourier series using a Riemann sum , by letting the \u0026lsquo;period\u0026rsquo; of a \u0026lsquo;periodic function\u0026rsquo; tend to $+\\infty$. This is probably more intuitive to work through, since we\u0026rsquo;ve already seen this sort of a calculation in introductory calculus classes.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nHere's a more general fact, of which this is a special case.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nThis is not an actual Fourier transform pair, just an illustration.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://shirazkn.github.io/posts/fourier/","summary":"The Fourier transform takes a function and outputs a different (possibly complex-valued) function. We also have something called the Fourier series, which takes a periodic function and gives a summation rather than an integral. I want to derive the Fourier series using the Fourier transform for anyone who is interested in connecting the two concepts with each other.","title":"Fourier Transforms of Periodic Functions"},{"content":"In this post, I want to bridge the gap between abstract vector spaces (which are the mathematical foundation of linear algebra) and matrix multiplication (which is the linear algebra most of us are familiar with). To do this, we will restrict ourselves to a specific example of a vector space \u0026ndash; the Euclidean space. Unlike the typical 101 course in linear algebra, I will avoid talking about solving systems of equations in this post. While solving systems of equations served as the historical precedent1 for mathematicians to begin work on linear algebra, it is today an application, and not the foundation of linear algebra.\nFor this post, I expect that the reader has come across concepts like linear independence and orthogonal vectors before, and can consult Wikipedia for anything that looks new to them.\nThe Recipe for $\\mathbb R^n$ We write $\\mathbb R^n$ as a short-hand for $\\mathbb R \\times \\mathbb R \\times \\dots \\times \\mathbb R$, the set of sequences (of length $n$) of real numbers. For notational convenience, we also use \u0026lsquo;$\\mathbb R^n$\u0026rsquo; to denote the $n$-dimensional Euclidean space, which is not just a set of objects, but a set of objects that has a particular structure. In order to arrive at this structure, we need to introduce the following mathematical ingredients, in order:\nScalars: Defined as the elements of a set (technically, a field ) which has two binary operations called addition and multiplication. We choose $\\mathbb R$ (the real numbers) as the set of scalars.\nVectors: For some integer $n \u003e 0$, we define the set of vectors as $\\mathbb R^n$. The vectors have the vector addition and scalar multiplication operations. These operations satisfy certain axioms which ensure that the addition and multiplication operations behave like they ought to.\nBasis: We need to pick a basis $\\mathcal B$ for $\\mathbb R^n$, which is a set of vectors $\\lbrace \\mathbf b_1, \\mathbf b_2, \\dots, \\mathbf b_n \\rbrace$, where $\\mathbf b_i \\in \\mathbb R^n$, such that every vector $\\mathbf v\\in \\mathbb R^n$ can be uniquely expressed as a linear combination of the basis vectors. This means that there is a unique sequence of real numbers $v^{(\\mathcal B)}_1,v^{(\\mathcal B)}_2, \\dots, v^{(\\mathcal B)}_n \\in \\mathbb R$ satisfying\n\\[ \\mathbf v= v^{(\\mathcal B)}_1 \\mathbf b_1 + v^{(\\mathcal B)}_2 \\mathbf b_2 + \\dots + v^{(\\mathcal B)}_n \\mathbf b_n \\] Inner Product: For vectors $\\mathbf v$ and $\\mathbf w$, $\\langle \\mathbf v,\\mathbf w \\rangle$ is called the inner product of $\\mathbf v$ and $\\mathbf w$; it maps each pair of vectors to a scalar. The usual inner product that we define for $\\mathbb R^n$ is sometimes called the dot product. An inner product imparts geometry to its vector space, because we can use it to define the \u0026rsquo;length\u0026rsquo; of a vector $\\mathbf v$ as $\\sqrt{\\langle \\mathbf v, \\mathbf v\\rangle }$, and \u0026lsquo;angles\u0026rsquo; between vectors as \\[\\theta(\\mathbf v,\\mathbf w) = \\arccos\\left(\\frac{\\langle \\mathbf v,\\mathbf w \\rangle}{\\sqrt{\\langle \\mathbf v, \\mathbf v\\rangle \\langle \\mathbf w, \\mathbf w\\rangle}}\\right)\\] Orthonormal Basis: If the basis $\\mathcal B$ is such that $\\langle \\mathbf b_i, \\mathbf b_j\\rangle = 1$ when $i=j$ and $0$ otherwise, we call it an orthonormal basis. Because of how we defined $\\theta$, $\\langle \\mathbf b_i, \\mathbf b_j\\rangle = 0$ implies that $\\theta(\\mathbf b_i, \\mathbf b_j)=90^\\circ$. Some notes on the basis: Every possible basis of a given (finite-dimensional) vector space has the same number of vectors in it; this number is called as the dimension of the vector space. If there were fewer than $n$ vectors in a basis, we would not have been able to describe every vector of $\\mathbb R^n$ as a linear combination of the basis vectors.\nThe set of basis vectors is always linearly independent; this comes from the requirement that each vector of $\\mathbb R^n$ can be expressed as a unique linear combination.2\nWe can construct a basis by picking linearly independent vectors one by one, until we are no longer able to do so.\nWe have introduced ingredients 3, 4, and 5 in a very specific order. Let\u0026rsquo;s see why that is so.\nThe Standard Basis Mathematicians avoid picking the basis $\\mathcal B$ explicitly. Often, they start their analysis with the following (implied) disclaimer:\n\u0026ldquo;We have chosen some basis, $\\mathcal B \\subseteq \\mathbb R^n$, but the specific choice of basis does not matter for what we\u0026rsquo;re about to show.\u0026rdquo;\nBasically, don\u0026rsquo;t worry too much about which basis we chose, just know that we have chosen one. Once a basis $\\mathcal B = \\lbrace \\mathbf b_1, \\mathbf b_2, \\dots, \\mathbf b_n\\rbrace$ has been chosen, each vector $\\mathbf v\\in \\mathbb R ^n$ can be uniquely expressed by a sequence of $n$ coefficients, $\\left(v^{(\\mathcal B)}_i\\right)_{i=1}^n$, such that $\\mathbf v=\\sum_{i=1}^n v^{(\\mathcal B)}_i \\mathbf b_i$. Thus, the vector $\\mathbf v$ can be expressed unambiguously using the following, more familiar notation:\n\\[\\begin{bmatrix} v^{(\\mathcal B)}_1\\\\ v^{(\\mathcal B)}_2\\\\ \\vdots\\\\ v^{(\\mathcal B)}_n \\end{bmatrix}\\] Note that this notation involves both a vector $\\mathbf v$ and a basis $\\mathcal B$. Choosing a different basis $\\mathcal B' = \\lbrace \\mathbf b'_1, \\mathbf b'_2, \\dots, \\mathbf b'_n \\rbrace$ changes the coefficients of the vector to $\\left(v^{(\\mathcal B')}_i\\right)_{i=1}^n$, but it does not change the vector itself. For bases $\\mathcal B$ and $\\mathcal B'$, we have\n\\[\\mathbf v=\\sum_{i=1}^n v^{(\\mathcal B)}_i \\mathbf b_i =\\sum_{i=1}^n v^{(\\mathcal B')}_i \\mathbf b'_i\\] At a glance, this assertion might appear to contradict with the following observation:\n\\[\\begin{bmatrix} v^{(\\mathcal B)}_1\\\\ v^{(\\mathcal B)}_2\\\\ \\vdots\\\\ v^{(\\mathcal B)}_n \\end{bmatrix} \\neq \\begin{bmatrix} v^{(\\mathcal B')}_1\\\\ v^{(\\mathcal B')}_2\\\\ \\vdots\\\\ v^{(\\mathcal B')}_n \\end{bmatrix} \\] This is purely because of the \u0026lsquo;square-bracket\u0026rsquo; notation. Before we write vectors in their \u0026lsquo;square-bracket\u0026rsquo; form, we must not only choose a basis, but also fix a basis. Let\u0026rsquo;s fix a basis $\\mathcal B$ for $\\mathbb R^n$, which we call as the standard basis. Now, for $c_1,c_2,\\dots,c_n\\in\\mathbb R$, the \u0026lsquo;square-bracket\u0026rsquo; notation\n\\[\\begin{bmatrix} c_1\\\\ c_2\\\\ \\vdots\\\\ c_n \\end{bmatrix} \\] refers unambiguously to the vector $\\sum_{i=1}^n c_i \\mathbf b_i$. Therefore, observe that\n\\[ \\begin{bmatrix} v^{(\\mathcal B)}_1\\\\ v^{(\\mathcal B)}_2\\\\ \\vdots\\\\ v^{(\\mathcal B)}_n \\end{bmatrix} \\neq \\begin{bmatrix} v^{(\\mathcal B')}_1\\\\ v^{(\\mathcal B')}_2\\\\ \\vdots\\\\ v^{(\\mathcal B')}_n \\end{bmatrix} \\text{ \\ because\\ } \\sum_{i=1}^n v^{(\\mathcal B)}_i \\mathbf b_i \\neq \\sum_{i=1}^n v^{(\\mathcal B')}_i \\mathbf b_i \\] Thus, there is a distinction between the vector itself and its representation in the standard basis $\\mathcal B$; the \u0026lsquo;square-bracket\u0026rsquo; notation gives us the latter, and it is our job to infer the former. Observe that the standard basis vectors $\\mathbf b_i$ can themselves be represented in the \u0026lsquo;square-bracket\u0026rsquo; notation, as\n\\[ \\mathcal B = \\left\\lbrace \\begin{bmatrix} 1\\\\ 0\\\\ 0\\\\ \\vdots\\\\ 0 \\end{bmatrix}, \\begin{bmatrix} 0\\\\ 1\\\\ 0\\\\ \\vdots\\\\ 0 \\end{bmatrix}, \\begin{bmatrix} 0\\\\ 0\\\\ 1\\\\ \\vdots\\\\ 0 \\end{bmatrix}, \\dots, \\begin{bmatrix} 0\\\\ 0\\\\ 0\\\\ \\vdots\\\\ 1 \\end{bmatrix} \\right\\rbrace \\] Notice that we can do our usual linear algebra stuff without actually specifying the contents of $\\mathcal B$, as long as we fix $\\mathcal B$ and don\u0026rsquo;t change it thereafter. Nothing about the orthogonality of $\\mathbf b_1, \\mathbf b_2, \\dots, \\mathbf b_n$ has been said yet, because we need an inner product to even define what orthogonality means.\nThe Dot Product We can now define an inner product in terms of the standard basis $\\mathcal B$. For vectors $\\mathbf v, \\mathbf w \\in \\mathbb R^n$, we define $\\langle \\mathbf v, \\mathbf w\\rangle = \\sum_{i=1}^n v^{(\\mathcal B)}_i w^{(\\mathcal B)}_i$, which we call as the dot product. In the matrix multiplication or \u0026ldquo;square-bracket\u0026rdquo; notation, we write this as\n\\[ \\begin{bmatrix} v^{(\\mathcal B)}_1 \u0026 v^{(\\mathcal B)}_2 \u0026 v^{(\\mathcal B)}_3 \u0026 \\dots \u0026 v^{(\\mathcal B)}_n \\end{bmatrix} \\begin{bmatrix} w^{(\\mathcal B)}_1 \\\\ w^{(\\mathcal B)}_2 \\\\ w^{(\\mathcal B)}_3 \\\\ \\vdots \\\\ w^{(\\mathcal B)}_n \\end{bmatrix} \\] Note that we are defining the inner product this way. Importantly, we are defining it in a way that makes the basis vectors, $\\mathbf b_1, \\mathbf b_2, \\dots, \\mathbf b_n$, orthonormal. If we had instead defined the inner product as $\\langle \\mathbf v, \\mathbf w\\rangle = \\sum_{i=1}^n v^{(\\mathcal B')}_i w^{(\\mathcal B')}_i$, then the basis $\\mathcal B'$ becomes orthonormal (under this new definition of orthonormality). Thus, any basis can be \u0026lsquo;made orthonormal\u0026rsquo; by redefining the inner product appropriately.\nThe \u0026lsquo;row vector\u0026rsquo; corresponding to $\\mathbf v$ is usually called the transpose of $\\mathbf v$, and is denoted as $\\mathbf v^\\intercal$. Strictly speaking, it is a linear map $\\mathbf v^\\intercal:\\mathbb R^n \\rightarrow \\mathbb R$ (See dual space if you\u0026rsquo;re curious about what\u0026rsquo;s going on here.)\nLinear Algebra Let $V$ and $W$ be vector spaces. They could be Euclidean spaces, but they could also be subspaces of Euclidean spaces (recall that a flat plane passing through the origin is a subspace of $\\mathbb R^3$), or something else entirely. A linear map or a linear transformation is a map $f:V\\rightarrow W$ which transforms each vector in $V$ to a vector in $W$ in a linear manner. This means that for $\\mathbf u,\\mathbf v\\in V$ and $a\\in \\mathbb R$,\n\\[f(\\mathbf u + \\mathbf v)= f(\\mathbf u)+f(\\mathbf v)\\] and\n\\[f(a\\mathbf u)= af(\\mathbf u)\\] Notably, we have $f(0 \\mathbf u) = f(\\mathbf 0) = \\mathbf 0$. The word \u0026rsquo;linear\u0026rsquo; comes from the special case of the linear map, $f:\\mathbb R \\rightarrow \\mathbb R$; the plot of this function is a straight line passing through the origin. This is also where the \u0026rsquo;linear\u0026rsquo; in linear algebra comes from: it is the study of linear maps in vector spaces.\nNow here\u0026rsquo;s where abstract linear algebra starts developing into the \u0026lsquo;matrix multiplication\u0026rsquo; version of linear algebra:\nAny linear map $f:V \\rightarrow W$ between two finite-dimensional vector spaces $V$ and $W$ can be represented as a matrix.\nTo see this, let\u0026rsquo;s start by choosing bases for $V$ and $W$, denoted as $\\mathcal B^{(V)} = \\lbrace \\mathbf b^{(V)}_1, \\mathbf b^{(V)}_2, \\dots, \\mathbf b^{(V)}_n \\rbrace$ and $\\mathcal B^{(W)} = \\lbrace \\mathbf b^{(W)}_1, \\mathbf b^{(W)}_2, \\dots, \\mathbf b^{(W)}_m \\rbrace$, where $n$ and $m$ are the dimensions of $V$ and $W$. For simplicity, we will assume that the scalars in $V$ and $W$ are real numbers (as opposed to, say, one of them being a complex vector space).\nObserve that $f(\\mathbf b^{(V)}_i)\\in W$. Each vector in the basis of $V$ is mapped (linearly) to a corresponding vector in $W$. This means that we can express each of the mapped basis vectors $f(\\mathbf b^{(V)}_i)$ as a linear combination:\n\\[ \\begin{align*} f(\\mathbf b^{(V)}_i) \u0026= F_{1i} \\mathbf b^{(W)}_1 +F_{2i} \\mathbf b^{(W)}_2 + \\dots + F_{mi} \\mathbf b^{(W)}_m \\\\ \u0026= \\sum_{j=1}^{m} F_{ji} \\mathbf b^{(W)}_j \\end{align*} \\] where $F_{ji} \\in \\mathbb R$ are unique. Now consider the action of $f$ on an arbitrary vector $\\mathbf v \\in V$ that is not a basis vector. We first write $\\mathbf v$ as the linear combination\n\\[ \\mathbf v = v_1 \\mathbf b^{(V)}_1 + v_2 \\mathbf b^{(V)}_2 + \\dots + v_n \\mathbf b^{(V)}_n \\in V \\] Due to the properties of a linear transformation (i.e., its linearity), we have the following algebra:\n\\[ \\begin{align*} f(\\mathbf v) \u0026= f\\left(v_1 \\mathbf b^{(V)}_1 + v_2 \\mathbf b^{(V)}_2 + \\dots + v_n \\mathbf b^{(V)}_n\\right)\\\\ \u0026= f\\big(v_1 \\mathbf b^{(V)}_1\\big) + f\\big(v_2 \\mathbf b^{(V)}_2\\big) + \\dots + f\\big(v_n \\mathbf b^{(V)}_n\\big)\\\\ \u0026= v_1 f\\big(\\mathbf b^{(V)}_1\\big) + v_2 f\\big( \\mathbf b^{(V)}_2\\big) + \\dots + v_n f\\big(\\mathbf b^{(V)}_n\\big)\\\\ \\end{align*} \\] Thus, the action of $f$ on the vector $\\mathbf v$ indirectly depends on the action of $f$ on the basis vectors. We have already seen where $f$ takes the basis vectors of $V$, so let\u0026rsquo;s plug that in:\n\\[ \\begin{align*} f(\\mathbf v) \u0026= \\sum_{i=1}^n v_i f(\\mathbf b^{(V)}_i) \\\\\u0026= \\sum_{i=1}^n v_i \\sum_{j=1}^{m} F_{ji} \\mathbf b^{(W)}_j \\\\\u0026= \\sum_{j=1}^{m} \\sum_{i=1}^n v_i F_{ji} \\mathbf b^{(W)}_j\\\\ \u0026= \\sum_{i=1}^n v_i F_{1i} \\mathbf b^{(W)}_1 + \\sum_{i=1}^n v_i F_{2i} \\mathbf b^{(W)}_2 + \\dots + \\sum_{i=1}^n v_i F_{mi} \\mathbf b^{(W)}_m \\end{align*} \\] where $\\sum_{i=1}^n v_i F_{1i}$ is the coefficient of $f(\\mathbf v)$ corresponding to the basis vector $\\mathbf b_1^{(W)}$. From here on, it\u0026rsquo;s only a matter of noticing that we can represent this entire relationship using the \u0026ldquo;matrix-multiplication\u0026rdquo; operation:\n\\[ \\begin{bmatrix} \\sum_{i=1}^n v_i F_{1i}\\\\ \\sum_{i=1}^n v_i F_{2i}\\\\ \\vdots\\\\ \\sum_{i=1}^n v_i F_{mi} \\end{bmatrix} = \\begin{bmatrix} F_{11} \u0026 F_{12} \u0026 \u0026 \\\\ F_{21} \u0026 F_{22} \u0026 \u0026 \\\\ \u0026 \u0026 \\ddots \u0026 \u0026\\\\ \u0026 \u0026 \u0026 F_{mn} \\end{bmatrix} \\begin{bmatrix} v_1\\\\v_2\\\\ \\vdots \\\\ v_n \\end{bmatrix} \\] which we can write as \u0026ldquo;$\\mathbf w = \\mathbf F \\mathbf v$\u0026rdquo;. There is a subtlety here: on the left-hand side of this equation, we assume the \u0026lsquo;standard basis\u0026rsquo; to be $\\mathcal B^{(W)}$, whereas for the vector on the right we were using the standard basis $\\mathcal B^{(V)}$. Thus, we need to fix both bases (one for $V$ and one for $W$) before the linear transformation can be written, unambiguously, as a matrix multiplication. If the dimensions of $V$ and $W$ are the same, we may pick the same basis on either side. Observe that we never used the inner product while talking about linear transformations, and thus, we do not claim whether the bases we used above are orthonormal. They are simply linearly independent, as all bases are. In case the basis $\\mathcal B^{(V)}$ is orthonormal, then this just means that we can find the coefficients $v_1, \\dots, v_n$ very easily: $v_i = \\langle \\mathbf v, \\mathbf b_i \\rangle$.\nOrthonormal Transformations Let\u0026rsquo;s now study $\\mathbb R^n$ as an inner product space, which is the vector space $\\mathbb R^n$ combined with the usual inner product \u0026ndash; the dot product.\nWe say that a matrix $\\mathbf U$ is orthonormal if $\\mathbf U^\\intercal \\mathbf U = \\mathbf U \\mathbf U^\\intercal = \\mathbf I$. This is closely related to how we say that a set of basis vectors is orthonormal: Suppose $\\mathcal B$ is an orthonormal basis, then so is the basis $\\mathcal B_U = \\lbrace \\mathbf U \\mathbf b_1, \\mathbf U \\mathbf b_2, \\dots, \\mathbf U \\mathbf b_n \\rbrace$, because\n\\[ \\langle \\mathbf U\\mathbf b_i, \\mathbf U\\mathbf b_j \\rangle = \\mathbf b_i^\\intercal \\mathbf U^\\intercal \\mathbf U \\mathbf b_j = \\mathbf b_i ^\\intercal \\mathbf b_j = \\langle \\mathbf b_i, \\mathbf b_j \\rangle \\] Let the underlying linear transformation corresponding to $\\mathbf U$ be denoted as $g:V\\rightarrow V$, with $\\mathcal B$ being an orthonormal basis for $V$. $\\mathbf U$ is the representation of $g$ in the matrix multiplication form, with respect to the basis $\\mathcal B$. Recall the algebra we did earlier:\n\\[g(\\mathbf v) = g\\Big( \\sum_{i=1}^n v_i\\mathbf b_i\\Big) = \\sum_{i=1}^n v_i g(\\mathbf b_i)\\] where we know that the set $\\lbrace g(\\mathbf b_1), g(\\mathbf b_2), \\dots g(\\mathbf b_n)\\rbrace = \\mathcal B_U$ is orthonormal. Thus, $\\mathbf v$ and $g(\\mathbf v)$ have the same representation (given by the numbers $v_1, v_2, \\dots v_n$) under $\\mathcal B$ and $\\mathcal B_U$. This is why we can call $\\mathbf U$ a \u0026ldquo;change of basis\u0026rdquo; \u0026ndash; it keeps the vector\u0026rsquo;s representation the same, but changes the (orthonormal) basis that we are representing it in. Even if the vector\u0026rsquo;s representation is same in either basis, the vector itself is changing under $\\mathbf U$:\n\\[ \\mathbf v = \\sum_{i=1}^{n} v_i \\mathbf b_i \\neq \\sum_{i=1}^{n} v_i g(\\mathbf b_i) = g(\\mathbf v) \\] Alternatively, we can re-express the transformed vector in the original basis $\\mathcal B$, in which case $g$ is interpreted as purely a transformation of the vector\u0026rsquo;s components while keeping the basis fixed. This duality in how we can view a \u0026lsquo;change of basis\u0026rsquo; has been explored more in this article .\nThe vectors $\\mathbf v$ and $g(\\mathbf v)$ have the same components if we rotate our head along with the transformation. They have different components if we keep our head fixed. These are two different (i.e., dual) ways of interpreting an orthonormal transformation. Preserving Structure and Dimension Any transformation on a mathematical space that preserves its structure (i.e., the relationships of its objects to each other) turns out to be quite special. Linear transformations preserve the structure of a vector space, because any three vectors $\\mathbf u,\\mathbf v,\\mathbf w\\in V$ which have the relationship $\\mathbf u + \\mathbf v = \\mathbf w$ are still related to each other after the transformation: $f(\\mathbf u) + f(\\mathbf v) = f(\\mathbf w)$.3\nStructure-preserving transformations which are also invertible are called as isomorphisms . We can show that the inverse $f^{-1}:W\\rightarrow V$, if it exists, must also be a linear transformation. Thus, $f^{-1}$ can be represented as a matrix. Invertible linear transformations are the isomorphisms of vector spaces. Invertible matrices are \u0026ldquo;square\u0026rdquo; because a linear transformation can only be invertible if its domain and codomain have the same dimension. 4\nSimilarly, orthonormal matrices represent the structure-preserving transformations in inner-product spaces : a set of vectors that is orthonormal before the transformation remains orthonormal after the transformation, where orthonormality is defined via the dot product. They are also the isomorphisms of inner-product spaces, because the inverse of an orthonormal matrix $\\mathbf U$ always exists $--$ it is $\\mathbf U^\\intercal$.\nMathematicians almost always (or perhaps, always) study mathematical objects \u0026ldquo;up to isomorphism\u0026rdquo;. This means that we are not studying any particular mathematical object, but rather we are simultaneously studying all of the mathematical objects that are isomorphic to each other. This is why we do not need to specify which basis we are using as the standard basis: it simply does not matter, as long as we fix this basis and stay consistent. This is analogous to how we may need to fix the origin when studying \u0026lsquo;displacement\u0026rsquo; and \u0026lsquo;speed\u0026rsquo; in physics. Choosing a different origin does not change the physical phenomenon, it only changes our description of it.\nSee this for the historical context of matrix multiplication, which is different from (but essentially the same as) modern mathematics\u0026rsquo; treatment of it.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nThe words every and unique can be compared to the concepts of surjectivity (also called as onto) and injectivity (also called as one-one), respectively. A function between two sets is invertible if and only if it is both surjective and injective. The \u0026lsquo;sets\u0026rsquo; here are the vectors and their representations.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nThere is an abuse (or rather, a reuse) of notation here; note that the vector addition in $W$ may be different from the vector addition in $V$, though we denote both as \u0026lsquo;$+$\u0026rsquo; for convenience. We also use \u0026lsquo;$+$\u0026rsquo; to denote the scalar addition operation.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nAn invertible function between sets must be injective and surjective. If the dimension of $W$ is greater than that of $V$, then $f$ cannot be surjective. If the dimension of $V$ is greater, then $f$ cannot be injective.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://shirazkn.github.io/posts/matrix/","summary":"In this post I want to bridge the gap between abstract vector spaces (which are the mathematical foundation of linear algebra) and matrix multiplication (which is the linear algebra most of us are familiar with). Unlike the typical 101 course in linear algebra, I will avoid talking about solving systems of equations.","title":"Matrix Multiplication"},{"content":"A running gag in engineering colleges is that a lot of instructors begin their first class of the semester with this question: \u0026ldquo;What is a vector?\u0026rdquo;. I used to find this ritual almost pointless because to me, every answer to this question felt either like a non-answer or a matter of context. I mean it depends, right? A structural engineer should have a different answer to this question than, say, a data scientist. To a structural engineer, a vector is a physical measurement that has a magnitude and a direction, whereas a data scientist may not necessarily think of a vector as having a direction. Indeed, in its full generality, a vector does not need to have geometric notions such as directions and angles associated with it. Today, I no longer think that this is a matter of context. By virtue of how we phrase the question \u0026ldquo;What is a vector?\u0026rdquo;, we may be asking (without ambiguity) the question: \u0026ldquo;If I call some mathematical object a vector, what does that tell you about it?\u0026rdquo;\nTo skip ahead to the punchline of this post, a vector is an element of a vector space. If that too feels like a non-answer to you, then I would argue that we were asking the wrong question in the first place. We should be asking instead \u0026ndash; \u0026ldquo;What is a vector space?\u0026rdquo;\nField Before defining a vector space, we need to define a field. To jump ahead a bit, $\\mathbb R$ and $\\mathbb C$ (the real and complex numbers, respectively) are examples of fields. Data scientists typically use $\\mathbb R$ as the underlying field for the vector spaces they are dealing with, but control theorists may work in a complex vector space, where the field under consideration may be $\\mathbb C$.\nA field, $K$, is a set of objects, together with two binary operations \u0026ndash; addition and multiplication (which we denote by \u0026lsquo;$+$\u0026rsquo; and \u0026lsquo;$\\times$\u0026rsquo;), that satisfy the so-called field axioms . The field axioms are a set of rules that establish the associativity, commutativity, and distributivity of addition and multiplication. For example, the addition operation is said to be associative if, for elements $a,b,$ and $c$ in $K$, $(a+b)+c=a+(b+c)$.\nIn addition, the field axioms stipulate that there be (distinct) identity elements for \u0026lsquo;$+$\u0026rsquo; and \u0026lsquo;$\\times$\u0026rsquo;. In the field $\\mathbb R$, we have for $a\\in \\mathbb R$, $a+0=a$ and $a\\times 1=a$, which make $0$ and $1$ the identity elements of addition and multiplication, respectively. Addition is invertible; multiplication is invertible with one exception \u0026ndash; multiplication by the additive inverse is not invertible .\nVector Space A vector space, $(V, K, $ +$, *)$, has the following ingredients:\na field, $K$, which comes with the addition and multiplication operations, \u0026lsquo;$+$\u0026rsquo; and \u0026lsquo;$\\times$\u0026rsquo;. Elements of $K$ are called scalars a set of objects, $V$, whose elements are called vectors two more binary operations, which we denote by \u0026lsquo;+\u0026rsquo; and \u0026lsquo;$*$\u0026rsquo;, where \u0026lsquo;+\u0026rsquo; is called the vector addition operator; it operates on two vectors to give another vector \u0026lsquo;$*$\u0026rsquo; is called the scalar multiplication operator; it combines a scalar and a vector to give another vector While $K$ satisfies the field axioms (by definition), the vector space $(V, K,$ +$, {\\ast})$ satisfies additional axioms . In addition to commutativity and associativity of \u0026lsquo;+\u0026rsquo;, it has some interesting axioms which may look trivial unless we distinguish the operations \u0026lsquo;$+$\u0026rsquo; and \u0026lsquo;$\\times$\u0026rsquo; (which belong to the field, $K$) from \u0026lsquo;+\u0026rsquo; and \u0026lsquo;$*$\u0026rsquo;. For instance, one of these axioms is that, for $a, b \\in K$ and $v \\in V$,\n$$(a + b)* v = a*v\\, \\char\"FE62 \\,b*v$$which is not obvious or trivial, because we haven\u0026rsquo;t stipulated anything else so far that requires \u0026lsquo;$+$\u0026rsquo; and \u0026lsquo;+\u0026rsquo; to behave anything like each other.\nAn obvious example of a vector space is $(\\mathbb R^n, \\mathbb R, $ +$, *)$, where $n$ is a positive integer. A less obvious example is the vector space of functions having a common domain (say, $\\mathbb R$), whose codomain is a vector space. In this case, the operations \u0026lsquo;+\u0026rsquo; and \u0026lsquo;${\\ast}$\u0026rsquo; are pointwise addition and multiplication of functions. For instance, for a scalar $\\alpha$ and a vector $f$, the scalar-vector multiplication $\\alpha * f$ yields the vector $g$, where $g(x)=\\alpha \\times f(x)$. We looked at this vector space in an earlier post .\nNote that $R^n$ is a vector space even if $n=1$. I\u0026rsquo;ve put together some examples of vector spaces in a table here .\nAdding more ingredients\u0026hellip; We can add more ingredients to a vector space to give it more structure. I explore this in other posts , but in summary,\nA normed vector space is a vector space, $(V, K,$ +$, {\\ast})$, along with an operation, $\\Vert \\cdot\\Vert : V \\rightarrow \\mathbb R$, which is called the norm. The norm should satisfy certain axioms. The norm is useful for defining the notion of a length or magnitude of a vector $v\\in V$ as being equal to $\\Vert v \\Vert$ units.\nAn inner product space is a vector space, $(V, K,$ +$, {\\ast})$, along with an inner product operation, $\\langle \\cdot,\\cdot \\rangle : V\\times V \\rightarrow K$, which is a binary operation that takes two vectors to its field. The inner product comes with its own set of axioms. In $\\mathbb R^n$, we can define $\\langle x, y\\rangle = x^\\top y$ as an inner product. Inner products are necessary to define geometric concepts such as directions and angles.\nThus, a vector may neither have a magnitude nor an direction! Someone who claims otherwise is perhaps referring to the inner product space $\\mathbb R ^n$ (where the inner product is the so-called \u0026lsquo;dot product\u0026rsquo;), which is a very specific example of a vector space.\nDoes any of this matter? What we have done is developed an axiomatic characterization of a vector space. All of your favorite properties of vectors are now theorems that follow from these axioms. A key motivation for doing so is that it allows us to define a lot of useful vector spaces where the vectors can be anything from functions to random variables. Elements of these vector spaces behave like vectors ought to, so we can do linear algebra with them, for e.g., we can construct bases and linear transformations for them. Importantly, we can extend them to more \u0026ldquo;structured\u0026rdquo; spaces by adding additional ingredients such as norms and inner products.\nEven if we are working within $\\mathbb R^n$, it can help to be aware of these subtleties. Consider the following paradox for vectors $u,v$ and $w$ in $\\mathbb R^n$, where we omit all the binary operations:\nWe have the well-defined column vector, $(u^\\top v)w = w(u^\\top v)$ Once we drop the parenthesis, the quantity \u0026lsquo;$u^\\top v w$\u0026rsquo; makes no sense, because it is not clear what \u0026lsquo;$v w$\u0026rsquo; is supposed to be, whereas \u0026lsquo;$w u^\\top v$\u0026rsquo; appears to be a well-defined column vector If one were doing a proof, it may appear reasonable to substitute $(u^\\top v)w$ with $w(u^\\top v)$ and call it \u0026lsquo;commutativity\u0026rsquo;. It may appear reasonable also, to drop the parenthesis and call that \u0026lsquo;associativity\u0026rsquo;. However, we seem to have made an error somewhere. What\u0026rsquo;s going on?\nRe-introducing all of the binary operations, we see that what we called \u0026lsquo;commutativity\u0026rsquo; is in fact the following statement (whose validity we are yet to determine):\n\\[ \\langle u, v\\rangle * w = w * \\langle u, v\\rangle \\] Strictly speaking, \u0026lsquo;$*$\u0026rsquo; is a function which is not symmetric in its arguments; it is a function such as $f(\\text{scalar}, \\text{vector})=\\text{vector}$. We cannot swap the arguments of $f$ around without redefining $f$, as part of which we should redefine the domain of $f$. Nevertheless, we may assume that $\\langle u, v\\rangle * w = w * \\langle u, v\\rangle$ is a meaningful statement for our notational convenience. On the other hand, what we called \u0026lsquo;associativity\u0026rsquo; earlier does not have any meaning, because two completely different binary operations are involved.\nAs a rule of hand, if you retain the binary operations, there is no way that it would lead to a \u0026lsquo;paradox\u0026rsquo;, but once you drop the binary operations in favor of concise (but potentially ambiguous) notation, you\u0026rsquo;re on your own.\nThe Bigger Picture At the end of the day, I do not recommend introducing all of the binary operations into your linear algebra proofs. I certainly do not recommend distinguishing between \u0026lsquo;$+$\u0026rsquo; and \u0026lsquo;+\u0026rsquo; (unless they happen to behave quite differently from each other). What I would propose is that one be aware of the intricacies of how we define a vector space. It is only natural to expect that similar intricacies are present in how we define any mathematical concept, that are neglected or brushed over in undergraduate engineering education.\nIn category theory , which is an area of mathematics that seeks to unify several disparate areas of mathematics, a most prized result is the Yoneda lemma , which in its essence says the following: A mathematical object can be defined (pretty much) uniquely without referring to any of its intrinsic characteristics, but purely by means of its relationships to other objects. For instance, the number $2$ refers to the concept where you have more than $1$, but also less than $3$ thing(s). $2$ is also less than $6$, and so on. Thus, we may define $2$ by virtue of its relationships to other objects in its category.\nThis is kind of like how we define not what a vector is, but rather, we define a vector space as the collection of objects that are related to each other in a certain way. The definition of a vector follows naturally \u0026ndash; it is an object that participates in this system of relationships.\n","permalink":"https://shirazkn.github.io/posts/vector/","summary":"A running gag in engineering colleges is that a lot of instructors begin their first class of the semester with this question \u0026ndash; \u0026ldquo;What is a vector?\u0026rdquo;. I used to find this ritual almost pointless because every answer felt either like a non-answer or a matter of context. Today, I no longer think that this is a matter of context.","title":"What is a Vector?"},{"content":" We talked about why sparsity plays an important role in many of the inverse problems that we encounter in engineering. To actually find the sparse solutions to these problems, we add \u0026lsquo;sparsity-promoting\u0026rsquo; terms to our optimization problems; the machine learning community calls this approach regularization.\nRegularization An optimization method that was popularized in the $80$s and $90$s is the LASSO , also called $L^1$ norm regularization, which solves problems of the following form:\n\\[ \\begin{array}{ll} \\underset{x\\in \\mathbb R^n}{\\textrm{minimize}} \u0026 g(x) + \\lambda \\|x\\|_1 \\end{array} \\] Usually, $g(x)$ corresponds to an error/loss term. It can also be the negative of something we wish to $\\text{maximize}$. The claim is that the additional term $\\lVert x\\rVert_1$ promotes the sparsity of the solution $x^\\star$, i.e., it attempts to set one or more elements of $x^\\star$ to $0$. Similarly, in ridge regression (also called Tikhonov regularization or $L^2$ norm regularization), we add a $\\lVert x\\rVert_2^2$ term to the objective. This is known to shrink the solution towards the origin, but it does not necessarily make the solution sparse.1\nWhat about $\\lVert x\\rVert_2$ (as opposed to $\\lVert x\\rVert_2^2$), what would that do? How do we reason about an arbitrary \u0026lsquo;regularization term\u0026rsquo; and interpret what it does? If you have encountered this question before, then you\u0026rsquo;ve likely seen explanations such as this one . 👈🏽 While that\u0026rsquo;s a great, conversational explainer on sparsity, I want to give it a slightly more formal treatment for anyone interested.\nSub-Gradient Descent I expect that the reader is familiar with gradient descent and convex functions. I will offer a brief introduction to sub-gradient descent, which extends gradient descent to the case where the objective function is non-differentiable, but still convex.\nA non-differentiable function is one that does not have a well-defined gradient at one or more points of its domain. But if the function is convex (i.e., bowl-shaped), then it has the next best thing: a sub-gradient of $f(x)$ at $x^\\star$ is a vector $w$, such that the inequality\n\\[ f(x) - f(x^\\star) \\geq w^\\intercal (x-x^\\star)\\] holds for all $x$ in the domain of $f$. The sub-gradient $w$ is not unique in general. However, if $f$ is differentiable at $x^\\star$, then $w$ takes the unique value of $\\nabla f(x^\\star)$. A convex function has at least one sub-gradient at every point of its domain; we can prove that fact using this theorem . Observe that sub-gradients can be thought of as hyperplanes that touch or support the function from below, similar to how the gradient of a differentiable convex function touches it from below.\nSince the sub-gradient is non-unique, we define the sub-differential of $f$ at $x^\\star$, denoted as $\\partial f(x^\\star)$, as the set of all sub-gradients of $f$ at $x^\\star$. We can now do gradient descent, but instead of the gradient, we pick a sub-gradient direction to descend towards. This procedure of sub-gradient descent is motivated by the following fact: $x^\\star$ is the global minimizer of $f(x)$ if and only if $\\mathbf 0 \\in \\partial f(x^\\star)$. For differentiable functions, sub-gradient descent reduces to gradient descent.\nSimilar to how, for differentiable functions,\n\\[\\nabla(f + g)(x)=\\nabla f(x) + \\nabla g(x),\\]\nwe have\n\\[\\partial (f+g)(x) = \\partial f (x) + \\partial g(x) \\] However, we are dealing with sets and not vectors in the non-differentiable case. The \u0026lsquo;$+$\u0026rsquo; in the preceding equation refers to the Minkowski sum; for sets $\\mathcal A$ and $\\mathcal B$,\n\\[ \\mathcal A + \\mathcal B = \\left\\lbrace a+b | a\\in \\mathcal A, b \\in \\mathcal B \\right\\rbrace\\] Revisiting the LASSO With this, let\u0026rsquo;s look at the LASSO-type problem,\n\\[\\begin{array}{ll} \\underset{x\\in \\mathbb R^n}{\\textrm{minimize}} \u0026 f(x) + g(x) \\end{array} \\] where the green lines show the sub-gradients of the two functions at $x^\\star$. This function is minimized whenever we can pick sub-gradients from $f$ and $g$ such that they \u0026lsquo;cancel each other out\u0026rsquo;.\n\\[f+g\\text{ is minimized at }x^\\star\\\\ \\Updownarrow \\\\ \\mathbf 0 \\in \\partial (f+g)(x^\\star) \\\\ \\Updownarrow\\\\ \\exists w \\in \\partial f (x^\\star) \\text{ \\ such that \\ } w+ \\nabla g(x^\\star) =\\mathbf 0\\] At the differentiable points ($x\\neq x^\\star$), neither function has much freedom in picking a sub-gradient. But at $x^\\star$, $f(x)$ has a range of sub-gradients to pick from; it can choose one that \u0026lsquo;cancels out\u0026rsquo; the corresponding (sub-)gradient of $g(x)$ at $x^\\star$. This is why a convex, non-differentiable regularization term is likely to pull the solution towards its non-differentiable points!\nChoosing a \u0026lsquo;Regularization Term\u0026rsquo; Suppose $x\\in \\mathbb R^2$. The function $\\lVert x \\rVert_1$ has the following shape:\nwhere the green plane is a sub-gradient at the origin. Since $\\lVert x \\rVert_1$ is non-differentiable along the axes, it tries to snap the minima towards the axes. Note that the axes of $\\mathbb R^2$ are exactly where the sparse vectors are. What about when $x\\in \\mathbb R^3$? At what points is $\\lVert x\\rVert_1$ non-differentiable then? (Hint: it\u0026rsquo;s not just the axes!) The function $\\lVert x \\rVert_2$ looks like an ice-cream cone:\nsince it\u0026rsquo;s only non-differentiable at the origin, it tries to snap the solution towards the origin. This is different from ridge regression, which instead uses $\\lVert x\\rVert_2 ^2$. The function $\\lVert x\\rVert_2 ^2$ is differentiable everywhere; it is \u0026lsquo;bowl-shaped\u0026rsquo;. It pulls the solution towards the origin, but does not particularly demand that the solution be exactly $\\mathbf 0$. So is there a use for $\\lVert x \\rVert_2$? Yes! It can be used to promote the block-sparsity of a vector, where the $0$\u0026rsquo;s of the vector appear in blocks. Consider\n\\[ x^\\intercal = \\left[\\ x_1^\\intercal\\ x_2^\\intercal\\ x_3^\\intercal \\dots x_n^\\intercal\\ \\right] \\] where $x_i \\in \\mathbb R^{d_i}$, and $x \\in \\mathbb R^{\\sum_{i=1}^n d_i}$. Suppose we know that the sparsity of $x$ occurs in blocks, i.e., some of the $x_i$ are full of zeros. Then, the regularization term $\\sum_{i=1}^{n}\\lVert x_i \\rVert_2$ is what we want to use since it sets some of the $x_i$ to $\\mathbf 0$ but does not promote sparsity within each block. (I used this fact to solve an engineering problem in my PhD dissertation .)\nClosing Note There are many different ways to think about sparsity. For instance, one could imagine trying to balance a tennis ball that is resting on one of the surfaces we showed above, by holding the surface from below and tilting it. The ball is likely to settle at one of the non-differentiable points of the surface, thereby minimizing its potential energy. I like the sub-gradient interpretation because it works irrespective of the dimension. We can test for differentiability of arbitrary functions even if we cannot visualize them.\nAs an aside, LASSO and ridge regression can be studied using the theory of proximal operators .\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://shirazkn.github.io/posts/sparsity_2/","summary":"We talked about why sparsity plays an important role in many of the inverse problems that we encounter in engineering. To actually find the sparse solutions to these problems, we add \u0026lsquo;sparsity-promoting\u0026rsquo; terms to our optimization problems; the machine learning community calls this approach regularization.","title":"Understanding Sparsity through Sub-Gradients"},{"content":"The so called curse of dimensionality in machine learning is the observation that neural networks with many parameters can be impossibly difficult to train due to the vastness of its parameter space. Another issue that arises in practice is that most of the neural network does not do anything, as a lot of its weights turn out to be redundant. This is because many (if not all) of the problems we\u0026rsquo;re interested in solving as engineers have some inherent sparsity. Steve Brunton has an excellent video explaining why this is so.\nAs a shorthand, the word \u0026lsquo;sparse\u0026rsquo; means \u0026lsquo;mostly zeros\u0026rsquo;. Here is a sparse vector:\n\\[x^{\\intercal}=[0\\ 3\\ 5\\ 0\\ 0\\ 1\\ 0\\ \\dots\\ 0\\ 8\\ 0]^{\\intercal}\\] Often, you might need to transform the original object into another domain, before the object looks sparse. As an example, the function $\\sin(t)$ is sparse in the frequency domain (it only has a single frequency component) but is non-sparse in the time domain, because $\\sin(t)\\neq 0$ for most values of $t$. The Fourier transform lets us move back and forth between the original and sparse domains. A lot of high-dimensional data transfer (like streaming videos, talking on Zoom) relies on exploiting the sparsity of the information, if not in the frequency domain, then in some other form. (See the post on Hilbert spaces for a more general treatment.)\nCompressive Sensing A field of research that blew up in the $2000$s is compressive sensing, in which a recurring theme is the following observation. Suppose you want to solve the problem $Ax=b$; you know $A$ and $b$, but not $x$. We call this a system of equations. It is a high-school math fact that exactly one of the following is true:\nthere is a unique $x$ such that $Ax=b$ there is no $x$ such that $Ax=b$ (overdetermined and inconsistent system of equations) there are infinitely many $x$\u0026rsquo;s such that $Ax=b$ (underdetermined system of equations) The last case arises when $A\\in \\mathbb R^{m\\times n}$ is a \u0026lsquo;wide\u0026rsquo; matrix, with $n\u003em$. This automatically means that $A$ has a non-trivial nullspace (or kernel)1, and for any $v\\in \\ker(A)$, $A(x+v)=Ax$. So we can construct infinitely many solutions this way.\nOne reason for solving $Ax=b$ might be because $b$ are the measurements that we have of an unknown vector $x$; $A$ is called the measurement matrix. If $n\\gg m$, it means that we have far fewer measurements than unknowns (underdetermined system of equations). The theory of compressive sensing says that it is still possible to recover $x$ uniquely if the solution is known to be sparse.2 And as we mentioned, the solution oftentimes is sparse. Instead of solving $Ax=b$, we can solve\n\\[\\begin{array}{ll} \\underset{x\\in\\mathbb R^n}{\\textrm{minimize}} \u0026\\|x\\|_0\\\\ \\textrm{subject to} \u0026 Ax = b \\end{array} \\] which picks out the sparsest solution (in terms of the number of $0$\u0026rsquo;s in $x$). The notation \u0026lsquo;$\\vert x\\vert_0$\u0026rsquo; is introduced here . In this way, we can uniquely reconstruct $x$ with a comically small number of measurements. (In fact, it can even beat the Nyquist sampling theorem .) The simple trick of searching for sparse solutions now allows us to do things like MRI imaging much more efficiently.\nWhy are sparse solutions special? So why is it that among the infinitely many solutions of $Ax=b$, the sparsest solution turns out to be precisely the solution we were looking for?\nSuppose $A\\in \\mathbb R^{m \\times n}$, $m\\leq n$, and $\\textrm{Rank}(A)$ is its rank. We know that $r = n- \\textrm{Rank}(A)$ is the dimension of its nullspace (see the rank-nullity theorem ). Then, the space of the solutions of $Ax=b$ is $r$-dimensional. Moreover, $r\\geq n-m$ since $\\textrm{Rank}(A)\\leq m$. So if we have too few measurements (i.e., a small value of $m$) then the space of solutions is rather large.\nNow suppose we know that the true solution $x$ is $s$-sparse, i.e., it has at most $s$ non-zero elements. There are $\\binom{n}{s}$ ways of choosing where these non-zero elements may appear. Each choice of the location of the non-zero elements (called as the support of $x$) defines an $s$-dimensional subspace. The space of $s$-sparse vectors is the union of these $s$-dimensional spaces. For e.g., let $n=3$ and $s=2$, then the $2$-sparse vectors in $\\mathbb R^3$ are\n\\[\\textrm{span}\\big(\\lbrace [1\\ 0\\ 0]^{\\intercal}, [0\\ 1\\ 0]^{\\intercal}\\rbrace\\big)\\ \\cup\\ \\textrm{span}\\big(\\lbrace [1\\ 0\\ 0]^{\\intercal}, [0\\ 0\\ 1]^{\\intercal}\\rbrace\\big)\\\\ \\cup \\ \\textrm{span}\\big(\\lbrace [0\\ 1\\ 0]^{\\intercal}, [0\\ 0\\ 1]^{\\intercal}\\rbrace\\big) \\] Unions of two subspaces is much smaller than the $\\textrm{span}$ of them. The set of all $1$-sparse vectors in $\\mathbb R^n$ is the union of the axes or the standard basis vectors of $\\mathbb R^n$, but the axes obviously span the whole space. Thus, even when $n\\gg m$, we can intersect this large $r$-dimensional space of solutions with the tiny $s$-dimensional slices to find the special, sparse solutions of $Ax=b$.\nIn the next post , I talk about why we can also swap $\\lVert x\\rVert_0$ out for $\\lVert x\\rVert_1$ in practice, and still recover $x$ uniquely and perfectly in many cases. Minimization of $\\lVert x\\rVert_0$ is a combinatorial problem (which means that the computational effort required to solve it scales exponentially in the dimension of the problem), but minimization of $\\lVert x\\rVert_1$ is a convex optimization problem, which admits efficient, scalable algorithms for solving it.\nUse the rank-nullity theorem, and the fact that $\\textrm{Rank}(A)\\leq \\textrm{min}\\\\;\\lbrace m,n\\rbrace$ for $A\\in\\mathbb R^{m\\times n}$.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nIn addition, $A$ needs to satisfy one of certain properties, such as the restricted isometry property. It essentially ensures that the measurements are somewhat orthogonal to each other, i.e., that we aren\u0026rsquo;t wasting the few measurements we do have by making redundant measurements.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://shirazkn.github.io/posts/sparsity/","summary":"The so called curse of dimensionality in machine learning is the observation that neural networks with many parameters can be impossibly difficult to train due to the vastness of its parameter space. This is because many (if not all) of the problems we\u0026rsquo;re interested in solving as engineers have some inherent sparsity.","title":"Sparsity"},{"content":"Let $\\mathcal X$ be a Hilbert space, which means that it is a vector space that has an inner product (denoted by $\\langle \\cdot, \\cdot\\rangle _\\mathcal X$) and that it is complete, i.e., it doesn\u0026rsquo;t have er\u0026hellip; holes in it. Recall that inner product spaces have a rich geometric structure, and so do Hilbert spaces. The Euclidean space $\\mathbb R^n$ is an obvious example, where the inner product is just the dot product. Mathematicians sometimes use \u0026lsquo;Hilbert space\u0026rsquo; to refer specifically to infinite-dimensional inner product spaces, but for our purposes, we will let \u0026lsquo;Hilbert space\u0026rsquo; include the finite-dimensional case.\nSome interesting Hilbert spaces are given below. Each of these spaces has a corresponding norm, $\\lVert x \\rVert_{\\mathcal X}=\\sqrt{\\langle x, x\\rangle}_{\\mathcal X}$, which we call as the norm that is induced by the inner product.\nHilbert Space $\\mathcal X$ Inner Product $\\langle \\cdot, \\cdot\\rangle _\\mathcal X$ Numbers $a$, $b\\in \\mathbb R$ $ab$ Random Variables $X,Y\\in \\mathbb R$ $\\mathbb E \\left[XY \\right]$ Real Vectors $x$, $y\\in \\mathbb R^n$ $\\mathbf x^\\intercal \\mathbf y$ Complex Vectors $x$, $y\\in \\mathbb C^n$ $\\mathbf x^\\dagger \\mathbf y$ Matrices $\\mathbf A$, $\\mathbf B\\in \\mathbb R^{m\\times n}$ $\\ \\text{Trace}(\\mathbf A^{\\intercal}\\mathbf B)$ Sequences $(x_i)_{i=1}^{\\infty}, (y_i)_{i=1}^{\\infty}\\in \\ell ^2(\\mathbb R) \\ $ $\\sum_{i=1}^{\\infty} x_i y_i$ Square-Integrable Functions $f,g \\in L^2(\\mathbb R)$ $\\int_{-\\infty}^{\\infty}f(x)\\overline{g(x)}dx$ Here, $\\overline{({}\\cdot{})}$ is the conjugate of a complex number (where you replace $i$ with $-i$), $({}\\cdot{})^\\dagger =\\overline{({}\\cdot{})}^\\intercal $ is called the conjugate transpose of a complex vector.\nProjections A defining feature of Hilbert Spaces is their geometry. Because they have notions of angles , we have\n$$ 0 \\leq {|\\langle x, y\\rangle_{\\mathcal X} |} \\leq {\\|x\\|\\|y\\|} $$where if the first equality holds we say $x$ and $y$ are orthogonal to each other, and the second equality holds if and only if they are linearly dependent: $x = a y$ for $a \\in \\mathbb R$ (or $\\mathbb C$, when $\\mathcal X$ is a complex vector space).\nWe can define the projection of an element $x\\in \\mathcal X$ onto a subspace $S\\subseteq \\mathcal X$:\n\\[ \\text{P}_S(x) = \\mathrm{argmin}_{y\\in S}\\|x-y\\|_{\\mathcal X}\\] where $\\lVert v \\rVert_{\\mathcal X} = \\sqrt{\\langle v, v\\rangle}_{\\mathcal X}$ is the induced norm. A remarkable by-product of these definitions of angles, orthogonality, and projection, is that it is consistent with our Euclidean intuition: $x-\\text{P}_S(x)$ is always orthogonal to $S$. A related theorem says that $\\text P_{\\tilde S}(x)$ is well-defined and unique even if $\\tilde S \\subseteq \\mathcal X$ is a closed convex set, although in this case we do not have orthogonality of $x-\\text{P}_{\\tilde S}(x)$ to the other elements in $\\tilde S$.\nAs an example, given two \u0026lsquo;vectors\u0026rsquo; $X$ and $Y$ in the Hilbert space of random variables, orthogonality corresponds to $\\mathbb E\\left[X Y\\right] = 0$ and the norm becomes the variance. In this case, orthogonal projection actually gives the least-squares estimator of given a random variable ( Lec. 22, p. 85 ). More generally, a projection can be used to compute the best approximation of an element of a Hilbert space (namely, the point being projected), with the approximation constrained to lie on a subspace (or a closed convex set).\nThe $(\\mathbb R^N, \\lVert{}\\cdot{}\\rVert_2)$ Space of Sequences Given a vector in the Euclidean space $\\mathbb R^N$ (equipped with a basis), we can view it as a set of $N$ coefficients. Each coefficient (a real number) describes the vector\u0026rsquo;s distance along the corresponding basis vector (or \u0026lsquo;axis\u0026rsquo;, if you will).\nTaking a few steps back, let\u0026rsquo;s begin by viewing $\\mathbb R^N$ as the set-theoretic product $\\mathbb R \\times \\mathbb R \\times \\dots \\times \\mathbb R$. An element $x$ of this space is a sequence of real numbers,\n\\[x = (x_1, x_2, \\dots, x_N).\\] Once we define notions of addition and scalar multiplication for such $N$-coefficient sequences, we are allowed to call $\\mathbb R^N$ a vector space. A norm for $\\mathbb R^N$ can be defined as $\\left(\\sum_{i=1}^N |x_i|^2\\right)^{1/2}$, which automatically induces a topology (notion of open and closed sets) on $\\mathbb R^N$. We can generalize all of this as $N\\rightarrow \\infty$.\nThe $\\ell^2$ Space of Sequences $\\ell^2(\\mathbb R)$ consists of countable sequences of real numbers. \u0026lsquo;Countable\u0026rsquo; here means that we can count them like we can count the natural numbers, but they are infinitely long nonetheless. We denote one such sequence as $(x_i)_{i=1}^{\\infty}$. A sequence is in $\\ell^2$ if and only if it is square-summable, which means\n\\[ \\sum_{i=1}^{\\infty}|x_i|^2 \u003c \\infty \\]\nA sequence which is not in $\\ell^2$ (i.e., an \u0026lsquo;infinite-length vector\u0026rsquo;) is $\\left(\\frac{1}{\\sqrt{1}}, \\frac{1}{\\sqrt{2}}, \\frac{1}{\\sqrt{3}}, \\dots\\right)$ because its sum-of-squares decays too slowly as the number of terms increases. A sequence that is in $\\ell ^2$ is $\\left(\\frac{1}{1}, \\frac{1}{2}, \\frac{1}{3}, \\dots\\right)$.\n(Separable) Hilbert Spaces are Isomorphic Isomorphisms are maps from one type of mathematical object to another that preserve its structure. All (separable) infinite-dimensional Hilbert spaces are isomorphic to the $\\ell^2$ space , which is a fancy way of saying that we can do the following: We first construct a countable orthonormal basis for $\\mathcal X$ (e.g., using the Gram-Schmidt process and perhaps Zorn\u0026rsquo;s lemma ), denoted as $(e_i)_{i=1}^{\\infty}$. Then, we define an isometric isomorphism (a distance-preserving, structure-preserving mapping) from $\\mathcal X$ to $\\ell^2$, as follows:\n\\[T: \\mathcal X \\rightarrow \\ell^2\\] \\[T(x) = (\\langle e_i,x \\rangle_{\\mathcal X})_{i=1}^{\\infty} \\] which is the sequence of coefficients of $x$ along each basis vector. Note that $T$ is basis-dependent, we could pick a different basis and get a different $T$. Since $T$ is an isomorphism (i.e., does not \u0026lsquo;destroy\u0026rsquo; information during the mapping), we may hope to be able to go back from $\\ell^2$ to $\\mathcal X$:\n\\[T^*: \\ell^2 \\rightarrow \\mathcal X\\] \\[T^{*}\\left((c_i)_{i=1}^{\\infty}\\right) = \\sum_{i=1}^{\\infty} c_i e_i \\] Here, $T^{*}$ is called the adjoint of $T$; it is an operator that satisfies\n\\[\\langle y,T(x)\\rangle _{\\ell^2} = \\langle T^*(y),x\\rangle _{\\mathcal X}\\]\nThe map $T({}\\cdot{})$ is necessarily linear , and it can be shown that $T^*$ is the inverse of $T$ (i.e., $T^\\ast = T^{-1}$, something which is not true for general linear transformations!). The mapping $T({}\\cdot{})$ can be represented via a matrix, although this matrix has infinitely many rows and columns.\nUsing a similar reasoning as above in the finite-dimensional case, we see that all (separable) complex Hilbert spaces with dimension $N$ are isomorphic to the $(\\mathbb C^N, \\lVert{}\\cdot{}\\rVert_2)$ space. The condition that $T^*=T^{-1}$ might remind you of unitary matrices ($U^\\dagger = U^{-1}$, where $U^\\dagger$ is the conjugate transpose of $U$), which are exactly the distance-preserving, structure-preserving matrices in $\\mathbb C^N$. When we\u0026rsquo;re working with real Hilbert spaces we replace $\\mathbb C^N$ with $\\mathbb R^N$ and \u0026lsquo;unitary\u0026rsquo; with \u0026lsquo;orthogonal\u0026rsquo;. Recall that $Q^\\intercal = Q^{-1}$ for an orthogonal matrix $Q$. Orthogonal matrices are an isometry (distance-preserving) because $\\lVert Qx\\rVert_2 = \\lVert x \\rVert_2$, and they are an isomorphism (structure-preserving) because $\\langle Qx, Qy \\rangle =\\langle x, Q^\\intercal Qy \\rangle = \\langle x, y\\rangle$. Note that isomorphisms also preserve the angles between vectors.\nThe $L^p$ Space of Functions Let\u0026rsquo;s move onwards to function spaces. The space $L^1(\\mathbb R)$ is the space of absolutely integrable functions. An element $f\\in L^1(\\mathbb R)$ is a function of the form $f:\\mathbb R\\rightarrow \\mathbb C$ and satisfies\n$$ \\| f\\|_{L^1} =\\int_{\\mathbb R} | f(x)| dx \u003c \\infty.$$It is clear how to add and (pointwise) multiply functions, so $L^1(\\mathbb R)$ is indeed a vector space. The space $L^2(\\mathbb R)$ consists of functions which are square-integrable:\n$$ \\| f\\|_{L^2} =\\left(\\int_{\\mathbb R} | f(x)|^2dx\\right)^{1/2} \u003c \\infty$$and is a Hilbert space because it has the following inner product + norm combination:\n$$ \\begin{align} \\langle f, g \\rangle_{L^2} \u0026= \\int_{\\mathbb R} f(x) \\overline{g(x)} dx\\\\ \\| f\\|_{L^2} \u0026= \\sqrt{ \\langle f, f \\rangle_{L^2}}. \\end{align}$$Can we use the above as an inner product for $L^1$ as well? It turns out that we can\u0026rsquo;t, because even if $f$ is in $L^1$, the integral of $f(x)\\overline{f(x)}$ can be unbounded if $f$ is not also in $L^2$.\nObserve that we always require some sort of 'boundedness of norm' when defining Hilbert spaces... it is related to the requirement of completeness of the space, which we have conveniently glossed over in this article. The norms on $\\ell^p$ and $L^p$ spaces are natural extensions of the $p$-norms for finite-dimensional vector spaces, and as expected, we have an inner product only when $p=2$. The $L^p(\\mathbb R)$ space can be thought of as a \u0026lsquo;refinement\u0026rsquo; of the domain of the $\\ell^p$ space1, where the index set $\\lbrace 1, 2, \\dots, \\infty\\rbrace$ (which was countable) of $\\ell^p$ is replaced with $\\mathbb R$ (which is uncountable) in $L^p$. So even though these might seem like completely different concepts thrown together, they are closely related and inherit much of each other\u0026rsquo;s properties!\nBut since $L^2$ and $\\ell^2$ are supposed to be isomorphic, this suggests that we can go from $L^2$ to $\\ell^2$ (and back) using a linear map. The fact that we can represent an arbitrary function on a continuous domain using a sequence of numbers is remarkable.\nFourier Transforms This section is for people who might have encountered the Fourier transform before, and want to see the Hilbert space interpretation of it. Consider a function $f\\in L^2([0,1])$, such that $f(t)$ represents the amplitude of a signal at time $t$. The sinusoid is one such signal: $f(t)=\\sin (t)$. An orthogonal basis for $L^2([0, 1])$ is $(e_k)_{k\\in \\mathbb Z}$, where\n$$e_k(t)=\\frac{1}{\\sqrt{2 \\pi}}e^{2 \\pi i k{t}}.$$Here, $\\mathbb Z$ are the integers, which is still a countable set because we can count them as $(0, 1, -1, 2, -2, \\dots)$. In this case, our isomorphism $T$ from $L^2([0,1])$ to $\\ell^2$ is given by:\n$$\\begin{align} T(f) \u0026= \\left(\\langle f, e_k \\rangle \\right)_{k\\in \\mathbb Z}\\\\ \u0026= \\left( \\frac{1}{\\sqrt{2\\pi}}\\int_{0}^1 f(t)e^{-2\\pi i kt} dt\\right)_{k\\in \\mathbb Z} \\end{align}$$These are exactly the Fourier coefficients (up to a constant factor, depending on how you define them)! Each coefficient tells you how much of a certain frequency is present in a signal. The way we have defined them (with proper normalization of the basis vectors) ensures that the Fourier transform $T$ is an isometric isomorphism. In fact, the observation that\n$$\\langle f, g\\rangle_{L^2([0,1])} = \\langle T(f), T(g)\\rangle_{\\ell^2}$$has a special name in the signal processing community: it\u0026rsquo;s called Parseval\u0026rsquo;s (or Plancherel\u0026rsquo;s) theorem. We can also discard some of the Fourier coefficients to compress (or de-noise) a signal by ignoring its weak (or bothersome) frequencies \u0026ndash; be it an audio signal or an image. Such a truncation of the sequence is a special case of the projection operation in Hilbert spaces, so it\u0026rsquo;s the best approximation in terms of the $L^2$ norm of the approximation error.\nOne of the motivations for defining the map $T$ is that we can now represent objects in an arbitrary Hilbert space using a (countable) set of numbers (in $\\ell^2$). Aside from being a powerful theoretical tool, it lets us store and manipulate these objects on computers efficiently, as evidenced by the example of the Fourier transform.\nAll that said, the reason I love typing out posts like these is because it\u0026rsquo;s so gratifying to see all of these different mathematical objects be unified under a single concept. The interplay between vectors, sequences, and functions is something that was never taught or emphasized to me in school. All throughout college, my instructors usually pulled the Fourier transform out of their hat, just to use it for 2 lectures and then put it back in before I ever figured out what it was. Maybe I\u0026rsquo;m just a slow learner, so it\u0026rsquo;s a good thing I have (one would hope) a long life ahead of me to keep learning!\nWe could also view it as a net .\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://shirazkn.github.io/posts/hilbert-spaces/","summary":"A Hilbert space is a vector space that has an inner product and that it is complete, i.e., it doesn\u0026rsquo;t have holes in it. Inner product spaces have a rich geometric structure, and so do Hilbert spaces. The Euclidean space is an obvious example, where the inner product is just the dot product.","title":"Hilbert Spaces"},{"content":"Let\u0026rsquo;s look at the norm balls corresponding to the different $p$-norms in $\\mathbb R^n$, where $n$ is the dimension of the space. For a vector $v\\in \\mathbb R^n$, the $p$-norm is\n\\[ \\|x\\|_p \\coloneqq \\left(\\sum_i |x_i|^p\\right)^{\\frac{1}{p}} \\] When $p=2$ this is the usual Euclidean distance. The corresponding ball is what we think of when someone says ball, it is all the points that are within a given distance from the origin. More generally, the ball of \u0026lsquo;radius\u0026rsquo; $r$ corresponding to a norm $\\lVert{}\\cdot{}\\rVert$ is\n\\[\\lbrace x\\ |\\ x\\in\\mathbb R^n, \\|x\\|\\leq r\\rbrace\\] Let\u0026rsquo;s call the ball corresponding to $\\lVert{}\\cdot{}\\rVert_p$ as the $p$-ball. If $r=1$, we will call it a unit ball. This website shows the unit balls corresponding to other $p$-norms. Here is my artistic illustration of the same:\nThe figure may indicate that $\\lVert{}\\cdot{}\\rVert_p$ gets bigger as $p$ increases, but it\u0026rsquo;s the other way around:\n\\[\\|x\\|_1 \\geq \\|x\\|_2 \\geq \\dots \\geq \\|x\\|_\\infty\\]\nbecause you need to go further from the origin to get to the $\\infty$ ball. This is similar to how a car with a poorer mileage would need to expend more fuel to get to the same point. So the balls get bigger, but the \u0026lsquo;mileage\u0026rsquo; gets smaller.\nIf $p\u003c1$, then the corresponding \u0026ldquo;$p$-norm\u0026rdquo; is not actually a norm, as it is guaranteed to violate the conditions which we usually place on a norm. What feature do we see appearing in the $p$-balls, when $p\u003c1$? The answer is that they curve inwards (are non-convex). In particular, the $0$-ball is quite bizarre, it is exactly the axes! ( Recall that the \u0026ldquo;$0$-norm\u0026rdquo; counts the number of non-zero elements in a vector, which is $1$ for each point on the axes). Let\u0026rsquo;s not talk about the $0$-ball for now.\nThe $2$-Ball When $n=2$ the $2$-ball is a circle, and when $n=3$ the $2$-ball is a sphere. What\u0026rsquo;s less obvious is the case of $n=1$, in which case each of the $p$-norm unit balls is just the line segment from $-1$ to $1$.\nWe may or may not remember from high-school physics that spheres (i.e., $2$-balls) minimize the ratio of the surface area to volume of a shape. It is why bubbles take spherical shapes, in order to minimize their potential energy. A defining feature of the $2$-ball is that it has continuous rotational symmetry \u0026ndash; we can rotate the $2$-ball without changing its shape, something we cannot do for any of the other $p$-balls.\nThe $\\infty$-Ball Recall that the $\\infty$-norm is evaluated by taking the limit $p\\rightarrow \\infty$ in the definition of the $p$-norm, giving\n\\[ \\|x\\|_\\infty=\\max\\left\\lbrace|x_1|, |x_2|, \\dots, |x_n|\\right\\rbrace \\] The $\\lVert{}\\cdot{}\\rVert_\\infty$ unit ball is always a cube in each dimension, it can be treated as the definition of a unit cube. As such, it has $2n$ faces in dimension $n$. This is because\n\\[ \\max\\left\\lbrace|x_1|, |x_2|, \\dots, |x_n|\\right\\rbrace \\leq 1 \\] \\[ \\Updownarrow \\] \\[ |x_1|\\leq 1\\quad \\text{and}\\quad |x_2|\\leq 1\\quad \\text{and}\\quad \\dots\\quad |x_n|\\leq 1 \\] \\[ \\Updownarrow \\] \\[ x_1\\leq 1\\quad \\text{and}\\quad -x_1\\leq 1\\quad \\dots\\quad x_n\\leq 1 \\quad \\text{and}\\quad -x_n\\leq 1 \\] which gives us $2n$ linear constraints, with each linear constraint adding a face.\nThe $1$-Ball The $1$-Ball in $2$ dimensions seems like a rotated square, or a diamond, and it appears like this should generalize. Let\u0026rsquo;s count its faces.\n\\[ |x_1| + |x_2| + \\dots + |x_n| \\leq 1 \\quad \\Leftrightarrow \\quad \\pm x_1 \\pm x_2 \\pm \\dots \\pm x_n \\leq 1 \\] where each \u0026lsquo;$\\pm$\u0026rsquo; indicates that we can choose either sign to obtain a different inequality. The total number of linear inequalities that we can construct is $2^n$. So $1$-balls have many, many more faces than $\\infty$-balls in higher dimensions!\nThis observation has consequences in the field of optimization; some poor bloke here is trying to check whether a given vector $x$ lies inside the $1$-ball. They write this out as a system of linear inequalities, only to realize that it would take an exponentially large number of linear inequalities to do this.\nEquivalence of Norms Given any two numbers $p_1$ and $p_2$ between 0 and $\\infty$, with $1\\leq p_1,p_2\\leq \\infty$, we can place $p_1$-balls inside and outside a given $p_2$-ball so that they touch. Two examples of this are as follows:\nThis is what is meant when we say that the $p$-norms are equivalent. Vectors are big (or small) in one norm if and only if they\u0026rsquo;re big (or small) in the other. \u0026lsquo;Big\u0026rsquo; and \u0026lsquo;small\u0026rsquo; are used very subjectively here.\nHigher Dimensions We can use everything we just introduced to show interesting quirks of high-dimensional geometry. While this is an entertaining discussion in its own right, it has important consequences in fields like deep learning and probability theory.\nThe Construction: Consider placing a cube of side length $2$ centred at the origin, as well as a green cube with side length $4$. Then place a sphere at each of the corners of the (inner) cube so that they touch each other, as follows:\nwhere by \u0026lsquo;sphere\u0026rsquo; and \u0026lsquo;cube\u0026rsquo; we really mean hypersphere and hypercube, which are their corresponding higher dimensional analogues. Finally, we try to squeeze in a new orange sphere at the centre. The entire configuration fits snugly inside the bigger green cube, and the maroon/pink spheres are in some sense \u0026rsquo;tightly packed\u0026rsquo; inside the green cube. The figure above is for dimension $2$.\nSome Observations: Let\u0026rsquo;s say we are in some higher dimension, $n$. If this feels like unfamiliar territory, we can think of these as $2$-balls and $\\infty$-balls, so we have all the machinery needed to be able to think about these objects at some capacity. For example, we know the following:\nFrom the origin, we can travel exactly $1$ unit along any axis to reach the inner cube The radii of the spheres at the corners is $1$ The corners of the inner cube are $\\sqrt n$ away from the origin, in terms of the Euclidean distance (or by a repeated application of the Pythagoras theorem ) When $n\\gg1$, the corners of the inner cube are extremely ($\\sqrt n$) far from the origin, whereas its faces are still $1$ unit away. People often phrase this as \u0026ldquo;higher dimensional cubes are spiky\u0026rdquo;. Here are some weirder facts:\nThe spheres at the corners still have radius $1$ There are $2^n$ spheres at the corners, because there are $2^n$ corners for the cube The orange sphere has a radius of $\\sqrt n -1$ Thus, in $1$ dimension, the orange sphere has radius $0$. In $3$ dimensions it has radius $\\sqrt 3 -1$. In $4$ dimensions it has radius\u0026hellip; $1$? It\u0026rsquo;s the same size as the spheres at the corners\u0026hellip; and it touches the faces of the inner cube\u0026hellip;\nIn the $10^{th}$ dimension, the sphere in the middle sticks out of the outer (green) cube! Because the corners of the cube have moved far from the origin, so have the unit spheres we placed at them, making more room for the orange sphere to expand with the dimension.\nGoing Inwards Let\u0026rsquo;s take now the following construction, where we inscribe a purple sphere inside the inner cube. These are precisely the unit balls of the $2$ and $\\infty$ norms:\nWe saw that the $\\infty$-ball always touches the $2$-ball from the outside, so even if we turn up the dimension, the sphere better remain contained within the cube. But the corners are still moving away from the origin, so the cube seems to be getting comparatively larger. Let $\\text{Vol}({}\\cdot{})$ denote the volume (given by the Lebesgue measure ) of a set.\n\\[\\frac{\\text{Vol}\\left(\\textit{\\ Unit Sphere in }\\mathbb R^n\\ \\right)}{\\text{Vol}\\left(\\textit{\\ Unit Cube in }\\mathbb R^n\\ \\right)}\\ \\rightarrow\\ 0\\qquad \\text{as}\\ n\\rightarrow \\infty \\] The higher dimensional sphere is insignificantly small in comparison to the cube. Actually, our earlier construction was also saying the same thing. The purple sphere can be thought of as one of the spheres we placed at the corners of the inner cube, each of which vanishes as $n\\rightarrow \\infty$!1\nThere\u0026rsquo;s a variety of ways of showing this, though it\u0026rsquo;s less straightforward than the stuff we did so far. My favorite proof of this (or rather, a comparable) result is from example $2.2.5$ of Rick Durett\u0026rsquo;s Probability Theory, where he shows it using the law of large numbers! The reason it stands out in my memory is not only because of the absurdity of showing a fact of geometry using something so seemingly far removed from geometry, but because he prefaces the result with the sentence, \u0026ldquo;Our next result is for comic relief.\u0026rdquo; The probabilistic argument shows that almost all of the volume of the cube is concentrated within some thin shell, somewhere between its center and its corners. I know, it\u0026rsquo;s bizarre.\nAn Intuition We can reuse my analogy about the \u0026lsquo;mileage\u0026rsquo; for norms. In higher dimensions, the $2$-norm has a lot of mileage in the \u0026lsquo;$45^{\\circ}$\u0026rsquo; direction, so we don\u0026rsquo;t need to go that far out to reach its unit ball. The $\\infty$-ball actually has barely any mileage in this direction, because we need to go $\\sqrt{n}$ far to reach its ball. Another useful way to think of this is to consider the sequence of real numbers (which is really what $\\mathbb R^n$ is),\n\\[(1, 1, 1, \\dots, 1)\\] Think about its sum ($1$-norm) vs. its maximum element ($\\infty$-norm), and remember that \u0026lsquo;mileage\u0026rsquo; of a norm corresponds inversely to the size of the norm ball.\nThe Curse of Dimensionality A quirk that shows up repeatedly in deep learning in various forms is the so-called curse of dimensionality. It usually refers to one of several things, but one of these is the vanishing of the higher dimensional sphere. Think of each dimension in the preceding discussion as the parameter of a neural network. The ambient space (where the purple sphere lives) is like the parameter space of the neural network; it is the set of all possible combinations of parameters. Searching for the right combination of parameters (the sphere) in the much larger parameter space (the cube) becomes more and more futile as the number of parameters (the dimension) increases.\nOn the other hand, certain quirks of higher dimensional geometry constitute what some researchers call the blessing of dimensionality . Of the different things it can refer to, one of the observations is that as the dimension increases, random sampling from a high-dimensional vector space becomes more and more well-behaved. The sampled vectors are increasingly likely to be orthogonal , and have a somewhat predictable length (due to Durett\u0026rsquo;s \u0026lsquo;volume concentration\u0026rsquo; example). It can be used to facilitate, rather than hinder high-dimensional computational tasks, so long as one knows what to watch out for.\nAt the end of this note the author places spheres at the corners of the $1$-ball instead. I don\u0026rsquo;t know enough topology to see if the \u0026lsquo;paradox\u0026rsquo; that arises in this case can be explained without introducing additional concepts.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://shirazkn.github.io/posts/balls/","summary":"Let\u0026rsquo;s look at the norm balls corresponding to the different p-norms. When p equals 2 this is the usual Euclidean distance. The corresponding ball is what we think of when someone says \u0026lsquo;ball\u0026rsquo;, it is all the points that are within a given distance from the origin.","title":"Norm Balls"},{"content":"To quote this math podcast , \u0026ldquo;the real world is a special case\u0026rdquo;. I mentioned in the last post that Euclidean geometry arises by taking $\\mathbb R^2$ or $\\mathbb R^3$ and endowing it with an inner product, at which point it satisfies the Pythagoras theorem. In this post I will talk about how the Pythagoras theorem is a special case of a more general feature of inner product spaces. Contents of the last post are pre-requisites for this one.\nThe Parallelogram Law Let $x$ and $y$ be two vectors in a normed vector space that we are interested in. Recall that the existence of an inner product $\\langle x, y\\rangle$ implies the existence of a corresponding norm, $\\lVert x\\rVert = \\sqrt{\\langle x, x\\rangle}$. But the converse direction is not always true. When it is true, is precisely when the normed vector space obeys the parallelogram law; for all vectors $x$ and $y$,\n\\[2\\|x\\|^2 + 2\\|y\\|^2 = \\|x+y\\|^2 + \\|x-y\\|^2 \\] The name of this law comes from the special case of $\\mathbb R^2$ shown above, where it is a relationship between the side lengths and diagonals of a parallelogram. Notably, if $\\lVert x+y \\rVert=\\lVert x-y \\rVert$, i.e., the parallelogram is a rectangle, then we recover the Pythagoras theorem. Thus, the Pythagoras theorem is a corollary (i.e., a byproduct) of the fact that $\\mathbb R^2$ equipped with the Euclidean norm $\\lVert{}\\cdot{}\\rVert_2$ satisfies the parallelogram law.\nNext, let\u0026rsquo;s see why the validity of the parallelogram law coincides with the existence of an inner product.\nSymmetric Bilinear Forms A symmetric bilinear form is a map $\\phi(x,y)$ that takes two vectors $x$ and $y$ of a vector space and gives a real number1, much like an inner product or a metric does. A symmetric bilinear form is symmetric\n\\[ \\phi(x, y) = \\phi(y,x) \\] and bilinear\n\\[ \\phi(x+y, z+w) = \\phi(x,z) + \\phi(x,w) + \\phi(y,z) + \\phi(y,w) \\] which means that it is linear in either argument. As a part of what we require of an inner product in a real vector space, they must be positive-definite symmetric bilinear forms. Positive definite means that $\\phi(x,x)\\geq 0$ and $\\phi(x,x)=0$ $\\Leftrightarrow$ $x=0$.\nNow, if we set $z=x$ and $w=y$ in the above expression, and using the positive-definite, symmetric, and bilinear properties of inner products, we get\n\\[ \\langle x+y, x+y \\rangle = \\langle x,x \\rangle + \\langle x,y \\rangle + \\langle y,x \\rangle + \\langle y,y \\rangle \\] \\[ \\| x+y\\|^2 = \\| x \\|^2 + 2 \\langle x,y \\rangle + \\| y\\|^2 \\] where we used the notation, $\\lVert x\\rVert = \\sqrt{\\langle x, x\\rangle}$. As $(-y)$ is also an element of our vector space, we can repeat the same steps to get\n\\[ \\| x-y\\|^2 = \\| x \\|^2 - 2 \\langle x,y \\rangle + \\| y\\|^2 \\] The sum of the last two equations is the parallelogram law, whereas subtracting the second equation from the first gives us\n\\[ \\langle x,y \\rangle = \\frac{\\|x+y\\|^2 - \\|x-y\\|^2}{4} \\] Observe that we can use the preceding equation as a definition for the inner product in terms of the underlying norm. Thus, normed vector spaces satisfying the parallelogram law have a unique inner product, which is defined as above. What remains to be shown is that this definition of an inner product using a norm, combined with the parallelogram law (which our norm supposedly satisfies), indeed satisfies all of the requirements that the inner product should .\nSpecial Cases Suppose the normed space we were working with was $\\mathbb R^n$ with the $2$-norm, $\\lVert{}\\cdot{}\\rVert_2$, then as one would expect, the unique inner product we get is the dot product for finite-dimensional vectors $x$ and $y$, which we usually write as $x^Ty$ or $x\\cdot y$ in place of the more general notation of $\\langle x, y \\rangle$. Other $p$-norms do not satisfy the parallelogram law, and hence do not have an associated inner product.\nAs we saw, specializing the underlying vector space to $\\mathbb R^2$ makes the parallelogram law a relationship between the sides and diagonals of a parallelogram. Further specializing to the case where $x$ and $y$ make an angle of $90^\\circ$ between each other, i.e., $\\langle x,y\\rangle = 0$, yields the Pythagoras theorem.\nFinally, in $\\mathbb R$, the law takes its most plausible form:\n\\[ (x+y)^2 + (x-y)^2 = 2x^2 + 2y^2 \\] Update: Someone on Mathstodon pointed out to me that what the parallelogram law is really saying is that the norm-squared function $f(x)=\\lVert x\\rVert^2$ is a degree $2$ polynomial. Let\u0026rsquo;s explore this real quick.\nNotice that a degree $2$ polynomial is characterized by the fact that its second derivative is constant everywhere. Suppose, this constant (which is the Hessian) is $c\\cdot\\mathbf I$, where $c$ is some number and $\\mathbf I$ is the identity matrix. Let\u0026rsquo;s take the Taylor series expansion of $f$ at $x$, sticking to the Euclidean space $\\mathbb R^n$ for simplicity.\n\\[ f(x+y) = f(x) + f'(x)^T y + \\frac{1}{2} y^Tf''(x)y \\] \\[ \\qquad \\ = f(x) + f'(x)^T y + \\frac{c}{2} f(y) \\] Similarly,\n\\[ f(x-y) = f(x) - f'(x)^T y + \\frac{c}{2} f(y) \\] Adding these,\n\\[ f(x+y) + f(x-y) = 2 f(x) + c f(y) \\] Naturally, we set $c=2$. Thus, we could potentially simplify the parallelogram law to: The norm-squared function is a polynomial of degree $2$, which sounds more fundamental and less arbitrary than the parallelogram law to me, let alone the Pythagoras theorem. But we need to do more work to generalize this math to hold outside of Euclidean spaces.\n\u0026hellip; more generally, an element of the field of that vector space, which is $\\mathbb C$ for complex vector spaces, and so on. We can generalize the above discussion to vector spaces over other fields if we were so inclined.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://shirazkn.github.io/posts/pythagoras/","summary":"I mentioned in the last post that Euclidean geometry arises by taking the real numbers and endowing it with an inner product, at which point it satisfies the Pythagoras theorem. In this post I will talk about how the Pythagoras theorem is a special case of a more general feature of inner product spaces.","title":"The Parallelogram Law"},{"content":"This is an explainer on norms, metrics, and inner products, and their relationships to each other.\nNorms A norm is any real-valued function $\\lVert{}\\cdot{}\\rVert$ (taking the elements of a corresponding vector space as its arguments), which has the following properties:\nIt is nonnegative, and $0$ only at the \u0026lsquo;zero element\u0026rsquo; (for e.g., at the origin of $\\mathbb R^n$).\n$\\lVert \\alpha x \\rVert = |\\alpha| \\lVert x \\rVert$ for any scalar $\\alpha$.\nIt satisfies the triangle inequality,\n\\[ \\|x+y\\|\\leq \\|x\\| + \\|y\\| \\]\nWe start by defining the usual vector $p$-norm:\n\\[ \\|x\\|_p \\coloneqq \\left(\\sum_i |x_i|^p\\right)^{\\frac{1}{p}} \\]\nThe cases of $p=0$ and $\\infty$ are evaluated using limits, and thus take special forms:\n\\[ \\|x\\|_0 = \\sum_i |x_i|^0 \\] \\[ \\|x\\|_\\infty=\\max\\left\\lbrace|x_1|, |x_2|, \\dots, |x_n|\\right\\rbrace \\]\nwhere $0^0$ is defined as $0$. So, $\\lVert x\\rVert_0$ counts the number of non-zero entries in the vector, whereas $\\lVert x\\rVert_\\infty$ picks out the maximum (in magnitude) entry of the vector.\nActually, $\\lVert x\\rVert_0$ isn\u0026rsquo;t really a norm (it fails property no. $2$, for one), but it still gets called the \u0026ldquo;$0$-norm\u0026rdquo; for convenience. I\u0026rsquo;ve seen some authors call it a pseudonorm. The \u0026ldquo;$0$-norm\u0026rdquo; has exploded in popularity due to its applications in the field of compressive sensing.\nMetrics While a norm takes one operand, distances/metrics take two operands. Observe that a norm can be interpreted as \u0026rsquo;the distance from the origin\u0026rsquo;. In fact, all norms give rise to a corresponding distance/metric for that space. The $p$-norm norm gives rise to (or as mathematicians say, induces) the Minkowski distance for $\\mathbb R^n$. Given $x,y\\in\\mathbb R^n$, the Minkowski distance/metric is\n\\[ D_p(x,y) \\coloneqq \\| x-y\\|_p = \\left(\\sum_i |x_i-y_i|^p\\right)^{\\frac{1}{p}} \\] When $p$=2, this is the Euclidean distance, the one we learn early on in our mathematical careers in the form of the Pythagoras theorem. The other interesting cases are when $p$ is $0$, $1$ or $\\infty$, in which cases it\u0026rsquo;s called the Hamming, Manhattan, or Chebychev distance. Metrics must satisfy properties that are analogous to those for norms except property no. $2$ (so the \u0026ldquo;$0$-norm\u0026rdquo; actually defines a proper metric even though it\u0026rsquo;s not a norm). In addition, metrics should satisfy the symmetry condition: $D(x,y)=D(y,x)$.\nThe space $\\mathbb R^n$ together with a norm $\\lVert{}\\cdot{}\\rVert_p$ is called a normed space, whereas $\\mathbb R^n$ together with the distance $D_p(\\cdot,{}\\cdot)$ is called a metric space.\nInner Products There is an additional notion of an inner product (which generalizes the \u0026lsquo;dot product\u0026rsquo; for vectors), giving an inner product space. Like metrics, an inner product takes two operands, $x$ and $y$, and gives a real number denoted by $\\langle x,y \\rangle$. It induces a corresponding norm, given by $\\lVert x \\rVert = \\sqrt{\\langle x,x \\rangle}$. Thus, an inner product space is a normed space, and a normed space is a metric space. Neither of the converse directions is true.\nInner product spaces are the most rare, or rather, we impose very specific conditions on inner product spaces which makes them so rare. This also means that they are endowed with a lot of structure; it takes more words to describe an intricate sculpture than it does to, say, describe a block of stone. Mathematicians like to say that inner product spaces are rich in structure. They allow us to define geometric concepts such as angles, which may not be easy to define for general metric spaces. The angle between two vectors $x$ and $y$ can be defined as\n\\[ \\theta = \\arccos\\left(\\frac{\\langle x,y \\rangle}{\\sqrt{\\langle x, x\\rangle \\langle y, y\\rangle}}\\right) \\]\nThis definition works not only in higher dimensional Euclidean spaces, but in any inner product space, including function spaces.\nThe only distance that comes from an inner product is the Euclidean distance ($p$=2), which comes from our most familiar inner product \u0026ndash; the dot product in $\\mathbb R^n$:\n\\[ D_2(x,y) = \\|x-y\\|_2 = \\sqrt{(x-y)\\cdot (x-y)} \\]\nThe dot product induces the $2$-norm which induces the Euclidean distance, so the whole gang is here. This is why we typically call $\\mathbb R^n$ the Euclidean space; since there is only one canonical inner product for it (the dot product), we don\u0026rsquo;t particularly mind restricting ourselves to the $p=2$ case because it feels somewhat natural to the space.\nThus, inner products are the rarest of them all, but also have the most structure. Metrics/distances are on the other extreme, because all Minkowski distances with $0\\leq p\\leq \\infty$ are proper metrics, but only $p=2$ corresponds to an inner product. Norms are sandwiched in between the two in terms of their rarity; \u0026ldquo;$p$-norms\u0026rdquo; are proper norms only when $p \\geq 1$.\n\\[ \\text{Inner Product Spaces} \\subset \\text{Normed Vector Spaces} \\subset \\text{Metric Spaces} \\]\nor in terms of the values for $p$, we can write\n\\[ \\lbrace 2 \\rbrace \\subset [1, \\infty] \\subset [0, \\infty] \\]\nAn interesting observation which always trips me up is that metric spaces don\u0026rsquo;t have to be vector spaces, or vice versa.\nWhat\u0026rsquo;s so special about $2$ anyway? Once again , I am annoyed by the special place that $2$ takes in this hierarchy of spaces. It has a host of interesting properties (which I will link in this post if I ever decide to explore them further):\nThe Euclidean norm ($p=2$) is the only norm that comes from an inner product.\nThe unit balls corresponding to the $2$-norm, $\\lbrace x : \\lVert x \\rVert_2\\leq1 \\rbrace$ are spherical.\nFor this reason, the Euclidean norm/distance ($p=2$) has the richest rotational isometry group . This follows from the observation that rotation of a sphere does not \u0026lsquo;change its shape\u0026rsquo;, but rotating a non-spherical object can \u0026lsquo;change its shape\u0026rsquo;. In fact, for $p\\neq2$, the other Minkowski distances do not have any continuous rotational isometries because their unit balls are non-spherical and pointy.\nThe next post gives some more insights. Also see my previous post .\nThe Euclidean space (the space $\\mathbb R^n$ combined with the $2$-norm induced distance) can be uniquely characterized by the fact that the distance-squared function is differentiable everywhere .\nThe last one was fascinating to me because the differentiability fails not only if you change the distance function, but also if you change to a different topological space, like say, a circle! So not only is the Euclidean distance special, the Euclidean space $\\mathbb R^n$ is itself quite special when equipped with the Euclidean distance.\n","permalink":"https://shirazkn.github.io/posts/norms_metrics/","summary":"An explainer on norms, metrics, and inner products, and their relationships to each other.","title":"Norms, Metrics, and Inner Products"},{"content":"The title is a reference to The Unreasonable Effectiveness of Mathematics in the Natural Sciences , a very popular paper by Eugene Wigner which explores how mathematics is unreasonably effective at not only explaining, but also predicting scientific phenomena. I had a similar question about the number $2$ which repeatedly shows up in engineering and science, specifically in the form of the $2$-norm of a vector, and seems surprisingly effective at doing what it\u0026rsquo;s supposed to do. I asked my Estimation Theory instructor at Purdue why this was so, and he told me that I ask too many (but good) questions. I have since then accumulated a variety of answers for why the number $2$ is, in some sense, ✨special✨ During our journey through this post and the next, we will visit the central limit theorem, Gaussian distributions, and Euclidean geometry.\n$2$-Norms in Statistical Regression Let me first elaborate on why I think $2$ shows up in engineering more often than it should. The first time I noticed this was while I was being taught least squares regression for the 100th time. Suppose we want to recover some vector $x\\in \\mathbb R^n$, but we are only able to observe (noisy) measurements of it, given by $y=\\Phi x + \\epsilon$, where $\\Phi \\in \\mathbb R^{m\\times n}$ is called the measurement matrix and $\\epsilon \\in \\mathbb R^m$ is some unknown noise vector. Then, we usually try to solve the following least squares problem:\n\\[ \\min_{\\tilde x} \\|y-\\Phi \\tilde x\\|_2 \\] and, well, it usually just works. We recover something close to $x$ and has desirable properties. But why we don\u0026rsquo;t we ever consider a more general $p$-norm , $\\lVert{}\\cdot{}\\rVert_p$ instead?\n\\[ \\|x\\|_p \\coloneqq \\left(\\sum_i |x_i|^p\\right)^{\\frac{1}{p}} \\]\nWell, we do indeed consider other norms sometimes. The $1$-norm is the next most commonly used, and it is called the \u0026lsquo;absolute deviation\u0026rsquo; of the error, leading to the least absolute deviations estimator. But the odds are, unless you\u0026rsquo;re a statistician you\u0026rsquo;ve never heard of this estimator. Why\u0026rsquo;s that?\nMaybe the answer lies in the central limit theorem (CLT) and the Gaussian distribution . The CLT says that whenever a large number of independent random variables are summed, their distribution is approximately Gaussian. The Gaussian distribution indeed has a (weighted version of) the $2$-norm sitting inside its exponent. Suppose the noise vector $\\epsilon$ in our least squares problem was distributed according to a multivariate Gaussian distribution with a zero mean and the covariance matrix $\\Sigma \\in \\mathbb R^{m \\times m}$, then its probability density function is\n\\[f_\\epsilon(\\zeta)=\\frac{1}{\\sqrt{(2 \\pi)^m \\det{(\\Sigma)}}} \\exp\\left(-\\tfrac{1}{2}\\zeta^T\\Sigma^{-1}\\zeta\\right) \\]\nSetting $\\Sigma=\\sigma I$, i.e., assuming that the error vector is isotropic (has identical statistical properties in every direction), gives us in the exponent $\\zeta^T \\zeta = \\lVert \\zeta\\rVert_2^2$. When we want to obtain a maximum likelihood estimate of $x$, maximizing a function such as $f_\\epsilon(\\zeta)$ amounts to minimizing the term in the exponent, which is $\\lVert \\zeta\\rVert_2^2$. There it is again, the mysterious least squares, now formulated as a maximum likelihood estimation problem!\nOf course, this is because we assumed $\\epsilon$ was Gaussian. If we had instead assumed $\\epsilon$ to have a multivariate Laplace distribution, then we would encounter the 1-norm. The 1-norm has some advantages such as being robust against outliers in the data, as well as being better suited for high-dimensional regression problems. There are both geometric and probabilistic ways of comparing the 1-norm (least absolute deviations) with the 2-norm (least squares). The geometric way looks at the effects of the 1 and 2-norms on the data, whereas the probabilistic way contrasts the assumptions of Laplace vs. Gaussian noise.\nBut we mentioned that the CLT is on team Gaussian. It makes the remarkably universal claim that Gaussian noise is in fact the assumption we want to make. If we can figure out what\u0026rsquo;s so special about Gaussians, then we would know exactly the conditions under which we can expect the $2$-norm to emerge as the reigning champion over other kinds of norms/metrics.\nThe Effectiveness of Gaussians Argument using Convolutions For this section, let\u0026rsquo;s only consider scalar-valued random variables. Gaussian distributions have some neat properties which can help explain their \u0026lsquo;central\u0026rsquo; role in the CLT:\nConvolutions, products, and Fourier transforms of two Gaussians is Gaussian.\nIn particular, when we sum two independent random variables, the distribution of the sum is given by a convolution (which for our purposes is just some operation that takes two functions and gives another) of the individual distributions. The Gaussian distribution is essentially a fixed point of this iteration, so every other distribution tends to it. This is similar to how if you take a calculator, enter some number, and then mash the \u0026lsquo;$\\sqrt{\\ \\ }$\u0026rsquo; button, then you eventually get stuck on the number $1$. This is (mostly) because $1$ is a fixed point of your iteration, $\\sqrt{1} = 1$. Similarly, the (properly scaled) sum of $n$ random variables tends to a Gaussian random variable as $n\\rightarrow \\infty$, due to it being the fixed point of the convolution operation. This is a partial justification/intuition for why the sum of a large number of random variables has a Gaussian distribution \u0026ndash; the CLT.\nThe CLT works irrespective of what distributions these individual random variables have, they can even be different. It is a statistical sledgehammer that works in a wide range of settings (much like its close cousin, the law of large numbers). For this reason, researchers and engineers often assume that the noise $\\epsilon$ of our statistical regression problem is Gaussian; when we sample our measurements using our macroscopic equipment, at say a 100Hz frequency, we are looking at the summed up version of the microscopic non-Gaussian fluctuations that have added up to give a Gaussian random variable.\nArgument using the Taylor Series I like to think of the CLT as a stepping stone to the law of large numbers (LoLN). Given i.i.d. random variables $x_i$ with mean $\\mu$ and some finite variance, let\n\\[ \\hat \\mu_n = \\frac{1}{n}\\sum_{i=1}^n x_i\\] Then the LoLN says that $\\lim_{n\\rightarrow \\infty} (\\hat \\mu_n- \\mu)$ equals $0$. We call $\\hat \\mu_n$ the sample mean. The CLT says something about what happens to $\\hat \\mu_n$ just before the LoLN kicks in. Notice that $\\hat \\mu_n$ is itself a random variable. The CLT says that $\\sqrt{n}(\\hat \\mu_n - \\mu)$ becomes more and more Gaussian distributed as $n\\rightarrow \\infty$.\nWhen $n\\gg0$, we are mutiplying a large, large number ($\\sqrt{n}$) with a small number ($\\hat \\mu_n - \\mu$). Think of this small number, $\\hat \\mu_n - \\mu$, as a random variable that has its own distribution. Its distribution starts off looking like whatever, then as you keep increasing $n$ it closes in on the $y$-axis, looking more and more Gaussian (CLT). Eventually (as $n\\rightarrow\\infty$) it hugs the $y$-axis because now the only possible value for $\\hat \\mu_n - \\mu$ is $0$ (LoLN).\nWhat\u0026rsquo;s interesting is that the number $2$ shows up in the proof of the CLT for the above reasons. Look at Wikipedia\u0026rsquo;s proof of the CLT using characteristic functions. The proof uses the Taylor series approximation of the characteristic function of $\\sqrt{n}(\\hat \\mu_n - \\mu)$. The first term is a constant that corresponds to the LoLN (since $e^0=1$), the second term is a \u0026lsquo;square\u0026rsquo; term which corresponds to the CLT. The higher order terms drop out faster than the leading order (square) term. Finally, the square term also drops out, leaving us with just the constant term. Once again, we see that the CLT kicks in just before the LoLN does, but in addition, we can also see why the asymptotic distribution has a \u0026lsquo;square\u0026rsquo; in the numerator \u0026ndash; it comes from the leading order term of the Taylor series.\nArgument using Symmetry Let\u0026rsquo;s assume that it was not the Gaussian, but in fact some probability density function (pdf) that looks like $f(\\zeta) = C{}e^{-\\tfrac{1}{2}\\lVert\\zeta\\rVert_p^p}$ which was the limiting distribution in the CLT. Of course, we already know that the Gaussian has the parameter $p=2$, we are trying to figure out what might be so special about $p=2$. The reason for the minus sign being in the exponent is for $f(\\zeta)$ to be integrable, so that $\\int_{\\mathbb R^n}f(\\zeta)d\\zeta=1$; it\u0026rsquo;s still a pdf at the end of the day.\nSince $g(x)=e^{x}$ is a strictly increasing function, given any $0\\leq\\delta\\leq 1$, we can choose a corresponding $\\tau\\geq0$, such that\n\\[ e^{-\\|\\zeta\\|_p^p} \\geq \\delta\\quad \\Leftrightarrow \\quad \\|\\zeta\\|_p^p \\leq \\tau \\]\nThe inequality on the left gives the level sets or the \u0026lsquo;horizontal slices\u0026rsquo; of the pdf. The inequality on the right is called the norm ball corresponding to the $p$-norm. Thus, the shape of the norm ball characterizes the shape of the pdf $f(\\zeta)$. Since in the CLT Gaussians play a \u0026ldquo;universal\u0026rdquo; role, one can argue that its level sets should be spherical \u0026ndash; perfectly symmetric. There is no reason that the distribution should favor one direction over the other, because independent random variables cannot conspire with each other to add up in a particular direction (especially when, as in the statement of the CLT, their distribution is arbitrary!).\nAnd what do you know, $\\lVert \\zeta\\rVert _p^p \\leq \\tau$ is spherical when and only when $p=2$! The $2$-norm has some more properties which are unique to it among all the other $p$-norms. As an aside, the spherical shape of Gaussians is also where the $\\pi$ in the normalization constant comes from . This is a purely aesthetic argument which may or may not be in the spirit of mathematics depending on where you\u0026rsquo;re coming from. Where I\u0026rsquo;m coming from, this was my favorite argument of the three!\n","permalink":"https://shirazkn.github.io/posts/leastsquares/","summary":"The title is a reference to The Unreasonable Effectiveness of Mathematics in the Natural Sciences. I had a similar question about the number 2 which repeatedly shows up in engineering and science, specifically in the form of the 2-norm of a vector, and seems surprisingly effective at doing what it\u0026rsquo;s supposed to do.","title":"The Unreasonable Effectiveness of '2' in Statistics"},{"content":"One of my motivations for starting a blog was Eugenia Cheng\u0026rsquo;s book The Joy of Abstraction 1. It\u0026rsquo;s a surprisingly accessible, gentle introduction to category theory, a topic that is usually only taught to graduate students in math. She compiled part of the book using notes from a class that she teaches at the Art Institute of Chicago, which is a testament to the aesthetic appreciation that one can expect to gain of category theory irrespective of their academic background! In this post, I will introduce the main ideas in category theory (as I best understand it) and show that it offers an elegant way of thinking about mathematics.\nThe Main Idea The mathematician Tai-Danae Bradley describes category theory as a sort of \u0026lsquo;mad libs\u0026rsquo; for mathematics. Mad libs is a game where you have a sentence with a few blank spaces, and depending on what you place in the blank space, you get a different interpretation out of it:\nYour [noun] is so [adjective] that it's making me [adjective]! Mad libs fix the structure of the sentence, but the objects you put in let you extract different interpretations out of it. Try putting in the words blog, boring and sleepy 😕\nCategory theory lets you do something similar. Every now and then you encounter something in mathematics that makes you go \u0026ldquo;Hey, this feels a lot like that other thing!\u0026rdquo; If there were some way to move back and forth between the two universes that have a similar structure, we can apply the insights that we gain in one universe to the other. The following is an example of where I might want to \u0026lsquo;move between mathematical universes with the same structure\u0026rsquo;:\nThe so called contrapositive of the statement $\\textbf A\\Rightarrow \\textbf B$ ($\\textbf A$ implies $\\textbf B$), is $\\neg \\textbf B \\Rightarrow \\neg \\textbf A$ ('not $\\textbf B$' implies 'not $\\textbf A$'). A statement and its contrapositive are logically equivalent, i.e., \\[\\textbf A\\Rightarrow \\textbf B \\quad \\textit{if and only if} \\quad \\neg \\textbf B \\Rightarrow \\neg \\textbf A\\] As an example, consider the statement $\\textbf A$ = \u0026ldquo;Today is Sunday\u0026rdquo; and $\\textbf B$ = \u0026ldquo;Tomorrow is a Monday.\u0026rdquo; The statement $\\textbf A\\Rightarrow \\textbf B$ is true, but so is its contrapositive $\\neg \\textbf B \\Rightarrow \\neg \\textbf A$, i.e., \u0026ldquo;If tomorrow is not a Monday, then today is not Sunday.\u0026rdquo;\nNow consider the following relationship between sets $A$ and $B$ (not to be confused with the propositions $\\textbf A$ and $\\textbf B$):\nLet $A^\\complement$ denote the complement of $A$, i.e., everything that's not in $A$. Then we have \\[ A\\subseteq B \\quad \\textit{if and only if} \\quad B^\\complement \\subseteq A^\\complement \\] Everything that's inside $A$ is also inside $B$.\nEverything that's outside $B$ is also outside $A$. Even if these examples are of no interest to you, all we need to note is that we replaced the symbol $\\Rightarrow$ (implies) with $\\subseteq$ (is a subset of), but the relationship between the corresponding objects looks nearly identical2. As the objects we\u0026rsquo;re swapping out become more complex, the insights that carry over during the \u0026lsquo;swapping out\u0026rsquo; become deeper.\nThe two different mathematical universes (logic and set theory) consist of objects (statements and sets) as well as relationships between them ($\\Rightarrow$ and $\\subseteq$). Similarly, a category is something that is made up of objects as well as their relationships to each other (called morphisms), satisfying some additional rules which impart to it its structure.\nCategories The best part of category theory is that it takes a whopping $10$ minutes or so to set up the foundations of it, but it comes with rich insights and new ways of thinking about math. Let\u0026rsquo;s give the definition of a category.\nA category $\\mathcal C$ consists of objects and morphisms; the latter are basically arrows from one object in $\\mathcal C$ to another. We write $\\text{ob}(\\mathcal C)$ to refer to the collection of all the objects in $\\mathcal C$. Given two objects $X$ and $Y$ in $\\text{ob}(\\mathcal C)$, we write $\\mathcal C(X,Y)$ to denote the collection of all the morphisms (i.e., arrows/relationships) from $X$ to $Y$. Note that (unlike in the above two examples) there may be more than one arrow in $\\mathcal C(X,Y)$.\nIf there is an arrow $f$ from $X$ to $Y$, and an arrow $g$ from $Y$ to $Z$, then we can draw an arrow from $X$ to $Z$ (which represents the path of going from $X$ to $Y$, then to $Z$). We call this specific arrow from $X$ to $Z$ as the composition of $f$ and $g$, denoting it as $g\\circ f$. We also say that $f$ and $g$ are composable, since the arrow-head of $f$ touches the tail of $g$.\nTo see why the composition is written 'backwards', i.e., as $g \\circ f$ rather than '$f \\circ g$', think of the objects as sets and the morphisms as functions between sets. That is, if $f:A\\rightarrow B$ and $g:B\\rightarrow C$, then $g(f(\\cdot)):A\\rightarrow C$ is a function from $A$ to $C$. Composition of functions is a special case of the composition in category theory. In addition to objects, morphisms, and their compositions, any category has an underlying structure that is imparted to it by the following properties:\nIdentities: For every object $X$, there exists a morphism $1_X$ from $X$ to itself, called the identity morphism. Moreover, for any $X$ and $Y$ in $\\text{ob} (\\mathcal C)$ and $f$ in $\\mathcal C(X,Y)$, we must have $f \\circ 1_X = f = 1_Y \\circ f $. In other words, $1_X$ is like $\\text{stay at X}$, and $1_Y$ is like $\\text{stay at Y}$. Associativity: For $f$, $g$ and $h$ in $\\mathcal C(W,X)$, $\\mathcal C(X,Y)$, and $\\mathcal C(Y,Z)$, respectively, we must have $h \\circ (g \\circ f)$ = ($h \\circ g) \\circ f$. Visually, the pink arrow in this figure can be obtained in two equivalent ways: For e.g., if $X$ and $Y$ are vector spaces and $M$ is the linear transformation that takes $X$ to $Y$, then the first bullet point is saying that $ M I_X = M = I_Y M$, where $I_X$ corresponds to the identity matrix of the vector space $X$ (note that $I_X$ and $I_Y$ may have different dimensions as matrices!). Similarly, the second bullet point is just stipulating the associativity of matrix multiplication, which we are so used to that we take it for granted. The order in which we compose arrows (that are composable) does not matter.\nIn the two examples we gave earlier, logic and sets, we introduced the morphisms $\\Rightarrow$ and $\\subseteq$, respectively. Since $\\textbf A \\Rightarrow \\textbf A$ and $A\\subseteq A$, we do have identity morphisms (arrows from the object to itself) at each object. Similarly, if the objects are numbers, then $\\leq$ is another morphism that serves both as an identity as well as a relationship between two distinct numbers. At the same time, $\u003c$ would not work as an identity morphism, as it is not true that a number is $\u003c$ itself.\nObserve that we are in the business of \u0026lsquo;swapping out\u0026rsquo; morphisms and not just objects.\nThe Category of Sets We could define a category $\\mathcal C$ consisting of sets (as its objects) in many different ways, depending on what sort of relationships (morphisms) we want to represent between them. However, the category of sets refers to one particular type of category. It is the category $\\mathcal C$ where $\\text{ob}(\\mathcal C)$ are sets and $\\mathcal C(X, Y)$ is the collection of all the functions from sets $X$ to $Y$. A function $f$ in $\\mathcal C(X, Y)$ assigns to each element in $X$ an element in $Y$.\nNow I\u0026rsquo;ll demonstrate why this representation of sets is cleaner than the way we usually think about sets. Recall that a function $f:X\\rightarrow Y$ is injective or \u0026lsquo;one-one\u0026rsquo; if for all $x_1,x_2 \\in X, f(x_1)=f(x_2) \\Rightarrow x_1 = x_2$. In other words, two different elements cannot map to the same element. The function $f$ is called surjective or \u0026lsquo;onto\u0026rsquo; if for every $y\\in Y$, there exists some $x\\in X$ such that $f(x)=y$. In other words, the image of $X$ under the operation $f$ gives all of $Y$, and not just some part of $Y$. (The arrows in the following illustration are NOT supposed to be morphisms! Let\u0026rsquo;s emphasize that these are illustrations rather than diagrams.)\nNow I don\u0026rsquo;t know about you, but these definitions look pretty asymmetrical to me. If I\u0026rsquo;d never seen them before, I would guess that they are talking about two completely different concepts. The two concepts are united by the fact that if $f$ is both injective and surjective, then it is invertible, i.e., there exists some function $g: Y\\rightarrow X$ such that $g \\circ f = 1_X$ and $f\\circ g = 1_Y$. In this case, $f$ is said to be bijective, and $g$ is called the inverse of $f$.\nAs an example, let $X$ be a set of people and $Y$ be some tasks or missions. $f$ is the assignment of people in $X$ to tasks in $Y$. The inverse function $g$ can then be thought of like \u0026lsquo;holding people accountable for their tasks\u0026rsquo;. If a task or a mission goes badly, we want the inverse function $g$ to tell us exactly who to hold responsible for its failure. If $f$ is not injective, either of two people who worked on the same task may be to blame. If $f$ is not surjective, it means that we don\u0026rsquo;t have anyone to blame for one of the tasks, because apparently no one was even assigned that task. These are the situations in which the inverse function $g$ fails to exist.\nWe just saw that invertible functions between sets has a purely categorical definition, namely the existence of a \u0026lsquo;reverse\u0026rsquo; morphism from $Y$ to $X$ satisfying some properties. We could extend this definition to define invertible morphisms in any category. At the same time, we could ask if there is a purely categorical definition of injective and surjective functions.\nMonics and Epics In any category $\\mathcal C$ with objects $X$ and $Y$, a morphism $f$ in $\\mathcal C(X, Y)$ is said to be a monomorphism or a monic between $X$ and $Y$ if for all objects $Z$ and morphisms $g_1$ and $g_2$ in $\\mathcal C(Z, X)$,\n\\[ f \\circ g_1 = f\\circ g_2 \\quad \\Rightarrow \\quad g_1 = g_2 \\]\nIn any category $\\mathcal C$ with objects $X$ and $Y$, a morphism $f$ in $\\mathcal C(X, Y)$ is said to be an epimorphism or an epic 😎 between $X$ and $Y$ if for all objects $Z$ and morphisms $g_1$ and $g_2$ in $\\mathcal C(Y, Z)$,\n\\[ g_1 \\circ f = g_2 \\circ f \\quad\\Rightarrow \\quad g_1 = g_2 \\]\nIn the category of sets, the monomorphisms and epimorphisms are precisely the injective and surjective functions, respectively. Now these definitions look so symmetric that they almost seem wrong. So why is it that they are equivalent definitions for injectivity and surjectivity, respectively? To begin with, why is a monomorphism, as defined above, necessarily an injective function (and vice versa)?3\n1. Monics are Injective Functions The trick is to think of a monomorphism $f:X\\rightarrow Y$ as a re-labeling of the elements of $X$. Suppose $X$ is a set of people and $Y$ is a set of (distinct) names, then let $f$ be the operation of assigning each person a name. $f$ is injective if each person gets a distinct name (no two people have the same name). In that case, if I give you a list of names, you know exactly which list of people I picked. In the figure, the purple arrows are the monomorphism (assignment of distinct names):\n$f$ would not be a monomorphism if two people were given the same name, say, John, i.e., two of the purple lines go into the same name. In that case, I could have picked either John (corresponding to two different lists of people, $g_1$ and $g_2$) but produced the same list of names ($f \\circ g_1 = f \\circ g_2$), and you would have no way of telling which list of people you were looking at since you do not know which John I picked. You would instead point a finger at me and accuse me of being \u0026lsquo;ambiguous\u0026rsquo;.\nSuccinctly, $f$ remembers which elements went to which elements. It does not forget that there are $n$ distinct people by assigning two or more of them the same name.\n2. Epics are Surjective Functions Finally, let\u0026rsquo;s see why the definition of epimorphisms of sets works as the definition for surjectivity. This is perhaps more difficult to see, because it looks so different from the set-theoretic definition of surjectivity. Let\u0026rsquo;s take this post to a full circle and invoke the contrapositive of the definition for an epic, which says that an epimorphism or an epic is, equivalently, a morphism $f$ in $\\mathcal C(X, Y)$ such that for all objects $Z$ and morphisms $g_1$ and $g_2$ in $\\mathcal C(Y, Z)$,\n\\[ g_1 \\neq g_2 \\quad\\Rightarrow \\quad g_1 \\circ f \\neq g_2 \\circ f \\]\nThis means that pre-composing $g_1$ and $g_2$ with $f$ retains the ability to distinguish $g_1$ and $g_2$. Suppose $f$ weren\u0026rsquo;t surjective, then we may no longer be able to distinguish $g_1$ and $g_2$ because $f$ is not even \u0026rsquo;looking at\u0026rsquo; one of the elements, which may be crucial to distinguishing $g_1$ and $g_2$:\nSimilarly, the definition of monics (which are the injective functions) was really saying that post-composing with $f$ retains the ability to distinguish two morphisms $g_1$ and $g_2$, because in that case $f$ was just a re-naming of the objects.\nThinking Categorically Two days after I first published this post, two of my favorite math communicators made a podcast episode about category theory. Eugenia Cheng talks to Steven Strogatz about why she finds joy in thinking categorically, i.e., approaching these types of math concepts from a category theory standpoint.\nWhat we showed above was that some concepts in math might be a bit messy or aesthetically lacking due to how they\u0026rsquo;re set up. This messiness also prevents us from generalizing ideas like injectivity and surjectivity to other fields in math, because they\u0026rsquo;re too deeply rooted in the language of set theory. Category theory allows us to squint at mathematical objects in just the right way, until we\u0026rsquo;ve \u0026lsquo;blurred out\u0026rsquo; the messiness, focusing our attention only on the thing that really matters: structure! As a bonus, we get to apply our insights to all of the other mathematical fields that have analogous properties, and therefore similar (or identical) structures. I would argue that this is a joyful way of learning mathematics. One where all of mathematics is yielding itself to you, simultaneously.\nAt the time of me writing this post, there is an ongoing book club for The Joy of Abstraction being hosted by its author.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nEither statement is a consequence of duality in category theory. Duality manifests again on my blog when I talk about differential forms .\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nWe just showed one direction of each of these equivalences, at best. The main goal is to get an intuition for why the two definitions of injectivity and surjectivity may be equivalent!\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://shirazkn.github.io/posts/cat_theory_1/","summary":"One of my motivations for starting a blog was Eugenia Cheng\u0026rsquo;s book The Joy of Abstraction. It\u0026rsquo;s a surprisingly accessible, gentle introduction to category theory, a topic that is usually only taught to graduate students in math. In this post, I will introduce the main ideas in category theory and show that it offers an elegant way of thinking about mathematics.","title":"Category Theory"},{"content":"In my earlier post I suggested that there is no objective notion of logical truth, that whether a statement is \u0026rsquo;true\u0026rsquo; can depend on the system of truth that one is operating in. Here we will develop that argument further using the concept of axiomatic systems. This is a long one, but I\u0026rsquo;m excited to talk about it!\nAxiomatic Systems An axiomatization is an assignment of rules (axioms) such as \u0026ldquo;one plus one equals two,\u0026rdquo; making up an axiomatic system or a formal system. If you happen to be within that formal system, then you must follow all of the assigned rules. The choice of axioms dictates the degree of expressiveness that one has within the formal system. For instance, North Korea has a small number of state-approved haircuts that everyone must choose from, which does not sound like a very expressive system.\nSimilarly, in pre-elementary school mathematics you are introduced to the natural numbers: $0$, $1$, $2$, $3\\dots$ You can add one natural number to another, but don\u0026rsquo;t you dare subtract them! Once we allow subtraction, we are able to express negative numbers such as -$5$. Then comes multiplication, which doesn\u0026rsquo;t seem to introduce any new numbers. By middle school, we are allowed the use of division, which grants us a greater degree of expression. We are able to concoct fantastic things such as $2.25$.\nChess Not all formal systems are stratified in this way. Consider the system of moves on a chessboard. It allows us to succinctly describe and manipulate objects in the world of chess, but it isn\u0026rsquo;t clear whether the system of chess goes \u0026lsquo;above\u0026rsquo; or \u0026lsquo;below\u0026rsquo; some other formal system about numbers.\nFormal systems which can (to varying extents) characterize the natural numbers include the Presburger arithmetic, Peano arithmetic, and the Zermelo theory. These are all formal systems named after their respective conceivers. There is no single canonical formal system for describing numbers. In fact, different systems capture different properties about the numbers, and we may be interested in exploring these properties and the ensuing mathematical structure. In the $1920$s, Zermelo\u0026rsquo;s system was extended (similar to how we extended the integers to the rationals, above) to the Zermelo-Fraenkel (ZF) set theory , often touted as being the foundation of modern mathematics , as it captures just about every mathematical concept that we come across in school. To see just how pedantic mathematicians are (I mean that in the most endearing sense), a key addition to the Zermelo-Fraenkel theory over Zermelo\u0026rsquo;s was that we could now count not just to infinity, but beyond . Of course, 21st century mathematicians are interested in adding more axioms to this system so that we can find new infinities in between the existing infinities 1. It seems as if somebody should stop them from taking this bit any further.\nI can\u0026rsquo;t speak for whether we invented numbers, but we definitely made up the rules for how they behave. I guess you could argue that we did make up numbers, but only so far as to effectively capture the essence and structure of our perceived reality. The natural numbers described by each of the above formal systems works precisely like counting does, $1$ plus $2$ makes a $3$. Maybe there is such a thing as a \u0026rsquo;number\u0026rsquo; that just exists out there, but the numbers we use in our everyday math is the version that was invented by humans. It\u0026rsquo;s Zermelo and Fraenkel\u0026rsquo;s best approximation of what being a number entails. Like the cavemen in Plato\u0026rsquo;s allegory, we can trace over shadows on a wall, even if we can\u0026rsquo;t see or touch the objects casting these shadows.\nMathematics is certainly not just the study of numbers, though. It can be regarded as the study of patterns and structure instead. One begins this process by carefully choosing the axioms so as to best represent the structure that is to be studied, then further structure emerges through the logical inferences that are made about these axioms. Different mathematical fields can emerge from different sets of axioms. For example, the axioms we use to define ordinary numbers tell us how to count sheep ($6$ sheep $+$ $8$ sheep $=$ $14$ sheep), but we could also lay down axioms that make $6+8$ equal $2$, so that we can describe how to count hours on a clock! For doing probability we could use the Kolmogorov axioms , for geometry there\u0026rsquo;s Euclid\u0026rsquo;s axioms, for group theory we have group axioms, and so on.\nThere is a school of thought within mathematics called logicism , which believes that all of mathematics was founded on logic \u0026mdash; logical inference was the glue that held axioms together to form theorems. More specifically, they believed that any mathematical truth could be proved using the one-two punch of axioms and logical inference. The mathematician David Hilbert was at the helm of this movement in the $1920$s. Hilbert is perhaps best known for posing a series of unsolved problems in mathematics, most of which were conjectures that mathematicians felt were true, but had neither been able to prove nor disprove yet. One of these conjectures was that all of mathematics can be founded on logic.\nBut wait a second, the whole point of this exercise was to figure out how logic could itself be defined. After all, logical inference should itself follow rules, else the formal systems based on them would be kind of messy and all over the place. Well, we could define logic as being a formal system unto itself! The Zermelo-Fraenkel theory (which, recall, is the formal system underlying most of modern mathematics) is built on first-order logic , which is indeed a formal system. First-order logic comes with, in addition to its axioms, its own formal system for logical inference, called zeroth-order logic or propositional logic .\nBefore we continue, it is worth mentioning that some languages (but not natural languages like English) can be defined as axiomatic systems too, and there are good reasons for why one might want to do this. Formal languages have been used to study linguistic structures. Alfred Tarski (of the Banach-Tarski paradox fame) also attempted to define truth using formal languages, noting that using a formal language one could construct a consistent, fool-proof definition for truth that was free of ambiguity.\nThen there are programming languages. Alan Turing introduced the concept of Turing machines (which are axiomatic systems) to study the fundamental limitations of computers, laying the foundation for all of theoretical computer science. Amazingly, he did so in a time when the word computer referred to a human occupation, not an electronic machine. We\u0026rsquo;ll revisit both Tarski and Turing\u0026rsquo;s works later in this post.\nChoosing the Axioms Let\u0026rsquo;s ask the question of how one chooses an axiomatic system. What kind of axioms would we want for our system? Suppose $A$ is an axiom and $\\neg A$ is its negation, i.e., the statement \u0026lsquo;$A$ is not true\u0026rsquo;. Clearly, $A$ and $\\neg A$ cannot both be axioms as this gives us a contradictory system. So we have our first requirement:\nCONSISTENCY \u0026#8211; It should not be possible to prove contradictory statements\nWe should also have the ability to prove things within an axiomatic system. Proofs (or disproofs) are how we can be certain that something is true (or false). Those unsolved problems that Hilbert posed? Surely, any good axiomatic system will have either a proof or a disproof of each of the conjectures. Anything that is proved, we can call a theorem. If a conjecture gets disproved instead, we can negate it and call that a theorem. That is ideally how we would like to do mathematics.\nCOMPLETENESS \u0026#8211; A given (logically meaningful) statement should admit either a proof or a disproof\nFinally, a characteristic of a useful axiomatic system is that we are able to do something with these axioms. A mathematics based on just $12$ numbers may be sufficient to tell the time, but there are many situations where we would like to count past 12. This leads us to the very subjective stipulation which we call expressiveness.\nEXPRESSIVENESS \u0026#8211; The axiomatic system should have a variety of objects and relationships that we can, for example, use to describe real-world phenomena\nWe now have all the background needed to get into the meat of this post.\nGödel\u0026rsquo;s Incompleteness Theorems The mathematician Kurt Gödel was able to show, around a 100 years ago, that no axiomatic system can have all three of these qualities. Specifically, he showed that any axiomatic system (that is expressive enough to do basic arithmetic ) is either incomplete or inconsistent. This is called Gödel\u0026rsquo;s first incompleteness theorem.\nImplications in Mathematics Let\u0026rsquo;s think about what this means. Suppose we design an axiomatic system that is sufficiently expressive (can do basic computations, like add one thing to another to give two things). Either there are some true statements in it which cannot be proved using the axioms (incompleteness), or there exist contradictory statements in it, each of which can be proved using the axioms (inconsistency). In fact, the latter case, inconsistency, is far worse, and pretty much a deal-breaker. A single contradiction can suspend all credibility of an axiomatic system, because in a contradictory axiomatic system one can \u0026ldquo;prove\u0026rdquo; just about anything . That\u0026rsquo;s bad! We don\u0026rsquo;t want to have one mathematician prove that $2$ plus $2$ equals $5$, another prove that it doesn\u0026rsquo;t, and have no grounds for deciding who\u0026rsquo;s right. So it\u0026rsquo;s looking like we\u0026rsquo;d rather mathematics be incomplete (i.e., missing some things) than inconsistent (i.e., self-contradictory).\nIt turned out that at least one of Hilbert\u0026rsquo;s unsolved problems could neither be proved nor disproved, namely the continuum hypothesis (which conjectured that there is no set that is bigger than the integers and smaller than the real numbers). There was absolutely nothing that the Zermelo-Fraenkel ($ZF$) system could say definitively about this conjecture. So then what do we do with the continuum hypothesis ($CH$)? Does it not have a place in mathematics? Well, we could assume $CH$ is true and adopt it as a new axiom. We could also assume that it is false (which we write as \u0026lsquo;$\\neg CH$\u0026rsquo;) and adopt that as an axiom instead. Adding an axiom like this comes with its consequences, though. For instance, adding $C$ (the axiom of choice ) to $ZF$ leads to the implication of the Banach-Tarski paradox . Put simply, this means that geometry can have non-intuitive (\u0026lsquo;weird\u0026rsquo;) properties in $ZFC$. So we want to double-check whether we indeed want geometry to behave in this way, before adding $CH$ (or $\\neg CH$) as an axiom to $ZF$. Mathematicians were able to show that adding $CH$ at least does not immediately make $ZF$ or $ZFC$ inconsistent, but that\u0026rsquo;s another possibility we should consider before adding new axioms. Given a choice, we\u0026rsquo;d rather be incomplete than inconsistent.\nReasons for Incompleteness There are axiomatic systems which are indeed consistent and complete, and first-order logic is one of them! This is the content of the lesser known Gödel\u0026rsquo;s completeness theorem (although there are differences in what \u0026ldquo; completeness \u0026rdquo; refers to in these theorems). The incompleteness theorems require us to first qualify how expressive a system is, before we can say anything about its incompleteness. So let\u0026rsquo;s now look at when and why a given axiomatic system, be it language, logic, or mathematics, comes under the purview of the incompleteness theorems.\nInfinity / Recursion: Observe that, if an axiomatic system has only finitely many statements in it, we can just enumerate through all of its statements and check whether they\u0026rsquo;re true. Sort of like a brute force proof. A brute force proof in a system with infinitely many \u0026rsquo;things\u0026rsquo; in it either may or may not terminate (see the four color theorem and the Riemann hypothesis , respectively, for examples of either case.)\nAlan Turing\u0026rsquo;s work involved the study of programming languages, or rather, the computer algorithms that can be written using them. He showed that no computer program can decide (i.e., determine with certainty) whether another computer program halts (i.e., terminates, as opposed to getting stuck in an infinite loop). This is called the halting problem, and it is said to be undecidable. Here, undecidable only means that there is no effective algorithm to solve the halting problem. Similarly, Gödel\u0026rsquo;s theorem says nothing about the (non-)existence of infinitely long proofs. It just says that there are statements that do not have finite proofs, but the word finite is typically omitted while stating the incompleteness theorems. (At this point, I would implore the reader who is familiar with Cauchy\u0026rsquo;s completeness in metric spaces to compare the two uses of the term \u0026lsquo;completeness\u0026rsquo; 😃)\nIt seems like it is necessary and sufficient for an axiomatic system to have some notion of \u0026lsquo;infinitely many things\u0026rsquo; in it for Gödel\u0026rsquo;s theorems to apply. But this isn\u0026rsquo;t quite the case either. Robinson arithmetic is an incomplete axiomatic system that can be generated using finitely many axioms. On the other hand, Tarski's theory of real closed fields is a complete and decidable axiomatic system that characterizes the infinitely large set of real numbers2.\nAs a segue, consider the following block of (Python) code, which prints infinitely long strings of letters:\n1 2 3 def recursive_function(x): print(x) recursive_function(x) The function above takes some input x and uses what programmers call \u0026lsquo;self-reference\u0026rsquo; to print the object x infinitely many times on the screen, indicating that self-reference has the potential to generate infinitely many things. This is a hint at the fact that self-reference, and not infinitude, may be the deeper reason for incompleteness.\nSelf-Reference: A universal feature of axiomatic systems where Gödel-like incompleteness theorems do apply is that these systems are capable of self-reference. Self-reference leads to what we call in common parlance paradoxes. A well-known example of a paradox arising from self-reference is \u0026ldquo;This statement is false\u0026rdquo;; we can\u0026rsquo;t assign a truth value to this statement without arriving at a contradiction, exactly as in Gödel\u0026rsquo;s theorem. Another well-known paradox from mathematics (which is of historical significance) is Russel\u0026rsquo;s paradox, which asks whether the set that contains all the other sets contains itself. Russel\u0026rsquo;s paradox showed that \u0026rsquo;naive set theory\u0026rsquo; (one in which you can even posit the existence of such a \u0026lsquo;set of all sets\u0026rsquo;) is inconsistent.\nThe \u0026lsquo; Barber paradox \u0026rsquo; is a more prosaic way of asking the same question: There is a village that has only one barber, who shaves those (and only those) who do not shave themselves. Does the barber shave themself?3\nA lot of Gödel-like theorems use proof by contradiction , where the contradiction arises from self-reference. Turing\u0026rsquo;s proof of the undecidability of the halting problem involves giving a computer program a version of itself as the input. These proofs are usually surprisingly accessible, you could watch this video if you\u0026rsquo;re curious.\nThe Second Incompleteness Theorem Okay, so many our axiomatic systems are incomplete, including mathematics. That might be a good thing, right? At least they aren\u0026rsquo;t inconsistent? Well, the second of Gödel\u0026rsquo;s incompleteness theorems states that a sufficiently expressive consistent system cannot prove its own consistency! As we already know that $ZF$ is incomplete (by the first incompleteness theorem), it is guaranteed that $ZF$ has some unprovable statements. One of these unprovable statements is that of its own consistency.\nRecall that (someone named) Tarski tried to define truth using axiomatic systems. In the last post, I had conflated between the concepts of truth and consistency. There is indeed a version of the second incompleteness theorem that swaps consistency out for truth. Tarski\u0026rsquo;s undefinability theorem says that the concept of truth in a formal system cannot be defined within that system. Tarski\u0026rsquo;s proof of the undefinability theorem mostly relies on self-reference, rather than on recursion as many other Gödel-like proofs do. It is also about formal systems in general, and not about mathematics. Owing to this fundamentality of Tarski\u0026rsquo;s theorem, the mathematician Raymond Smullyan showed in $1957$ that Gödel\u0026rsquo;s incompleteness theorems can be applied to many more formal systems than was believed in Gödel\u0026rsquo;s time. For these reasons, Smullyan also espouses that Tarski\u0026rsquo;s work should get much of the attention that Gödel\u0026rsquo;s does.\nIt seems like the study of consistency (or provability, truth, etc.) should itself be privy to the problem of self-reference, since we say that a formal system is unable to prove its own consistency. For that matter, the second incompleteness theorem is phrased a bit peculiarly; it doesn\u0026rsquo;t preclude the possibility that one formal system can prove the consistency of another.\nTranscending Language using Metalanguage As opposed to proofs in a formal system, proofs about a formal system can be expressed in a metalogic or a metalanguage. The metalanguage sits \u0026lsquo;above\u0026rsquo; the formal system it is looking at, called the object language, in the sense that it is often more \u0026lsquo;powerful\u0026rsquo; or expressive than the object language, and assumes the authority to state and prove theorems about the object language. Proofs about (in)consistency, (in)completeness, (un)decidability, and (un)provability are often stated in a metalogic/metalanguage. Similarly, Tarski\u0026rsquo;s definition of truth in a language (which is in this case the object language) was stated in a metalanguage, and his undefinability theorem showed that the metalanguage and its object language do not necessarily coincide. As in the case of the incompleteness theorems, all of this only holds unless the object language was not very expressive to begin with. If the language is such that self-reference and/or basic arithmetic are not within its expressive capabilities, then it is indeed possible that it could serve as its own metalanguage, and that it is both complete and consistent.\nNatural Languages But natural languages are expressive by design, that\u0026rsquo;s their whole point. They encompass everything that we would want to talk to each other about. This includes a multitude of axiomatic systems, spanning formal language, logic, and mathematics. Since linguists, logicians, and mathematicians talk to each other in natural languages like English, natural languages should necessarily operate as metalanguages. If there were such a thing as a \u0026lsquo;formal natural language\u0026rsquo;, it would seem that Gödel\u0026rsquo;s theorems must apply to it, by virtue of its expressiveness. But in contrast to axiomatic systems of mathematics, we\u0026rsquo;d probably want a formal natural language to be complete rather than consistent.\nCounter-intuitively, formal systems that are 'bigger' are also more likely to be incomplete; recall that first-order logic is complete, but ZF is not! One way to resolve this is to consider that a small box is easy to fill up with things, but bigger boxes take a lot of work to fill up (a box with twice the side length has $8$ times the volume!). The property of being bigger increases the box's propensity for incompleteness. If you went to an English-speaking school, you were probably taught the semantics and grammatical constructs of the English language in English. Mathematicians are able to state Gödel\u0026rsquo;s theorems in English. Computer programmers use English to write pseudo-code. A natural language such as English has some element of completeness in the sense that we want it to do everything. If it weren\u0026rsquo;t complete, we\u0026rsquo;d just want a metalanguage \u0026lsquo;above\u0026rsquo; it to fill in the holes, and so on. For example, a non-native English speaker may instinctively switch to their native tongue when they\u0026rsquo;re trying to express something complex, something they may not have the vocabulary for in English. It is even possible that they went to a school where English (the object language) was taught to them in their native tongue (the metalanguage).\nThis is similar to how we added $CH$ to $ZF$ earlier to \u0026lsquo;complete\u0026rsquo; it in some sense, but we cautioned that it comes at the risk of making the more powerful language, $ZF+CH$, to be inconsistent. In the case of natural languages, we embrace this potential inconsistency as being the trade-off for completeness.\nBefore we close, I should do my due diligence and note that, while formal languages can be used to discuss natural languages, the latter fall under the category of informal languages and may not have a definable notion of truth. In fact, Tarski explicitly warned against the extension of his ideas to natural languages. Tarski's critics also stress on the distinction between mathematical truth and metaphysical truth, which should not be conflated with each other. Nonetheless, one could argue that the mathematical (or rather, the axiomatic) approach to defining truth is as clean and air-tight a characterization of truth as we can hope to achieve. We just need to bear in mind that we aren\u0026rsquo;t here to establish a singular meaning for the word truth, but to appreciate the intricacies involved in defining truth and logic. The self-reference in trying to arrive at the true definition of truth shall not be lost on us.\nThe two infinities of the continuum hypothesis are the sizes of the natural numbers and the real numbers. See Cantor\u0026rsquo;s diagonalization for a proof of the fact that these sizes are indeed different, one of those profound-yet-accessible proofs in math. Incidentally, the mathematical tool used in proving Gödel and Tarski\u0026rsquo;s theorems is named the diagonal lemma for its resemblance to Cantor\u0026rsquo;s diagonalization.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nThe fact that there exists a Gödel-complete axiomatization of the real numbers should come as a surprise. Surely, any axiomatization of the reals must be more expressive than the so-called \u0026lsquo;basic arithmetic\u0026rsquo; that Gödel\u0026rsquo;s theorems apply to, right? The reason why Tarski\u0026rsquo;s axiomatization of the reals is complete is because it cannot do the basic arithmetic that is stipulated in Gödel\u0026rsquo;s theorems. While Tarski\u0026rsquo;s axiomatization captures the properties of the reals, it does not have what it takes to define integers and their arithmetic properties. (Think of a number line that does not have any markings on it!) In fact, if one so much as introduces a $\\sin$ function into this axiomatization, it becomes undecidable , because the $\\sin$ function indirectly allows us to encode integer arithmetic within the system.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nThe barber in the Barber paradox appears to be a special person whom the rule does not apply to. This is similar to how the \u0026lsquo;set\u0026rsquo; in Russel\u0026rsquo;s paradox cannot be treated as just any other set, but perhaps we could give it another name in order to avoid self-reference, and thus avoid contradiction.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://shirazkn.github.io/posts/language_and_logic2/","summary":"I suggested that there is no objective notion of logical truth, that whether a statement is \u0026rsquo;true\u0026rsquo; can depend on the system of truth that one is operating in. Here we will develop that argument further using the concept of axiomatic systems.","title":"The Incompleteness Theorems"},{"content":"This is my first post! It discusses the question of whether spoken and written languages like English could be \u0026rsquo;logical\u0026rsquo; by design. I will break this post up into two parts. The first one does not require a mathematical background whatsoever, whereas the second touches on the concepts of axioms, theorems and proofs.\nFallacies The word logic as it is used in everyday parlance refers to informal logic (as opposed to formal logic , which is instead a rigorous mathematical construct). You\u0026rsquo;ve probably come across logical fallacies such as the false dilemma , which goes something like this:\nPerson 1: Women and trans people are in need of better mental health infrastructure.\nPerson 2: How can that be, when this statistic shows that it's mostly cis men who committed self harm in 2016?\nPerson 3: Oh yeah? What about this other statistic about trans people which clearly shows that... Here, Persons 2 and 3 are the culprits, for insisting that only one of two alternatives can be true. This is what we mean by a false dilemma. Of course, there isn\u0026rsquo;t actually a dichotomy here; men need better mental health support systems encouraging them to process their feelings; trans people and women would benefit from policies that make it easier and safer for them to seek help. The false dilemma fallacy makes for great bipartisan politics.\nPerhaps a more subtle example of a logical fallacy is one that misdirects not necessarily in its written form, but through accent or emphasis. This is the fallacy of accent , which refers to how the interpretation of a sentence can be modified by emphasizing one word over another:\nTrans people are in need of better mental health infrastructure.\nTrans people are in need of better mental health infrastructure. The second sentence may be excluding the possibility that cis people need better mental health infrastructure. It suggests that, if there is a box labeled \u0026ldquo;people who need better mental health infrastructure,\u0026rdquo; then it has room only for trans people. Whatever else you might have put into that box earlier is now gone.\nThe Wikipedia page for the false dilemma fallacy hints at why natural (spoken and written) languages are so amenable to misinterpretation:\nOur liability to commit false dilemmas may be due to the tendency to simplify reality by ordering it through either-or-statements, which is to some extent already built into our language.\nWhoever wrote this sentence suggests that fallacies are built into our language.\nAs a bit of a sidebar, last month, my mom demonstrated a mastery of the false dilemma, using it to win an argument that we were having over something silly. I couldn\u0026rsquo;t help but marvel at how natural her suggestion seemed to me at the moment, when conveyed in the language we were conversing in – Urdu. If only we had been speaking English instead, I would have seen right through her ploy. Maybe it\u0026rsquo;s because I use English and not Urdu to reason in my daily life. I have only ever spoken Urdu to my family, and I don\u0026rsquo;t reason with my family (mostly because it\u0026rsquo;s always a losing battle). Have I been conditioned into forgoing logical thinking while talking in Urdu? Or are fallacies not only built into our language (whatever that means), but also vary in their nature and potency depending on the language and context that we\u0026rsquo;re in?\nThe text After Babel by George Steiner offers a possible explanation for why fallacies might be built into languages. On page $231$, he gives his perspective on how languages are shaped over time:\nWe speak first to ourselves, then to those nearest us in kinship and locale. We turn only gradually to the outsider, and we do so with every safeguard of obliqueness, of reservation, of conventional flatness or outright misguidance.\nAs human communities manufactured their languages, they did so with the express intent of obscuring any meaning to outsiders, to maintain secrecy. If anything, the ability to lie and misdirect with language has been critical to its popularization . Just as the politicians today use the false dilemma to garner support, so have those in power used logical fallacies to their advantage for thousands of years. The honest working-class person seldom wrote seminal historical texts, their language has not propagated through time quite as well. Only those with money, time, and/or inflated egos have had the luxury of writing influential texts. The punchline comes from page $224$ of Steiner:\n\u0026hellip; the uses of language for \u0026lsquo;alternity\u0026rsquo;, for mis-construction, for illusion and play, are the greatest of man\u0026rsquo;s tools by far.\nIt is certainly one of my mom\u0026rsquo;s greatest tools.\nTruth is Subjective Humans are organized into systems, communities that talk to each other in a certain language. Between two systems, languages, with all of their words, syntaxes, enunciations, and cultural connotations, can differ either slightly (Californians v. New Yorkers) or drastically (Urdu speakers v. English speakers). Can the notion of being logical differ between the two systems (and thus, their languages)?\nWhen in the $1600$s Galileo suggested that the Earth went around the sun, he was called a heretic for contradicting with the Catholic church\u0026rsquo;s Earth-centric model of the world. It was false that the Earth went around the Sun. In a community where everyone believes in God, it is considered perfectly logical to ascribe the creation of the universe to God. There is no reason to prove it, because to a logician within this system, the theory that God created the universe is perfectly consistent with all the theories that came prior to it. Consider as another example, that most physicists do not bother to prove that time flows in one direction . They take the flow of time for granted while solving complex equations, secretly hoping that it doesn\u0026rsquo;t lead them to any contradictions down the line. We don\u0026rsquo;t actually know for sure that time flows only in one direction, but it seems to align with all the observations we\u0026rsquo;ve made so far as a species.\nThese are things we lay down as foundations for our worldview, and we always need to start somewhere in order to draw further inference.\nIt turns out that we define truth quite similarly. Once you are within a system, the truth of that system is anything that is consistent within the system. Inconsistency, i.e., contradicting with the aforementioned truths, is what is meant by being false. How was Galileo to propose his radical idea to the world, had he not the access to a language that allowed him to be logically inconsistent with what was considered true at the time?\nIn both mathematics and scientific deduction, contradictions are a major blow to the field, because it means that we need to upend much of the theory that we\u0026rsquo;ve been relying on. It\u0026rsquo;s like finding out that the foundation of the building you\u0026rsquo;ve been constructing so meticulously was fractured all along. Sometimes it\u0026rsquo;s exciting all the same, it just means that we need to have an open mind and accept a new idea. Albert Einstein lived through at least three such \u0026lsquo;major blows\u0026rsquo; during his lifetime. The first was of his own doing, the realization that spacetime curves and contorts in ways that are impossible for humans to even visualize. The second time, he was the one who was unsettled by the Danish physicist Neils Bohr\u0026rsquo;s theory of quantum physics .\nThe third blow came in the $1930$s, this time aimed straight for the very foundation of mathematics as it was known at the time. It was revealed that not only is truth subjective, but that there exist statements that can neither be classified as true nor false1. Sometimes, things we suspect are true will elude any proof. Mathematicians did not like that, the whole point of mathematics was to prove things and to be sure that we won\u0026rsquo;t run into any inconsistencies. Mathematics was the one language that was supposed to be free of this sort of ambiguity.\nPrior to these revelations, mathematicians had been sleeping soundly in their mathematical beds, unaware that a man named Kurt Gödel was undoing the screws off of their bedframe. In the next post , we will see just what Gödel (and others such as Alan Turing ) had to say to the world of mathematics and logic, what any of this has to do with language, and why these findings necessitate that natural language should be logically inconsistent if we are to do anything meaningful with it.\nThis is eerily reminiscent of superposition in quantum physics . No doubt it gave A. Einstein quite the shiver down his spine.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://shirazkn.github.io/posts/language_and_logic1/","summary":"This post discusses the question of whether spoken and written languages like English could be \u0026rsquo;logical\u0026rsquo; by design. The first part does not require a mathematical background, whereas the second touches on the concepts of axioms, theorems and proofs.","title":"Misuse as a Use of Language"}]