Linear Algebra & Matrix Calculus: Gradients, Hessians, and Optimization Review
Dive into the world of matrix calculus, where we explore gradients, Hessians, and their applications in optimization problems. Learn how these tools connect to eigenvalues, least squares, and determinant analysis, and understand their significance in fields like machine learning, physics, and engineering.
1. Basic Concepts and Notation
Linear algebra provides a compact way to represent and operate on sets of linear equations. For example, consider the following system of equations:
\[ 4x_1 - 5x_2 = -13 \] \[ -2x_1 + 3x_2 = 9. \]
This consists of two equations with two variables. As you know from high school algebra, we can find a unique solution for \( x_1 \) and \( x_2 \), unless the equations are degenerate (e.g., if one equation is a multiple of the other). In this case, there is indeed a unique solution. Using matrix notation, we can write the system more compactly as:
\[ A\mathbf{x} = \mathbf{b}, \]
where
\[ A = \begin{bmatrix} 4 & -5 \\ -2 & 3 \end{bmatrix}, \quad \mathbf{b} = \begin{bmatrix} -13 \\ 9 \end{bmatrix}. \]
Analyzing linear equations in this form offers many advantages, including computational efficiency and compact representation.
1.1 Basic Notation
We use the following notation:
- By \( A \in \mathbb{R}^{m \times n} \), we denote a matrix with \( m \) rows and \( n \) columns, where the entries of \( A \) are real numbers.
- By \( \mathbf{x} \in \mathbb{R}^n \), we denote a vector with \( n \) entries. By convention, an \( n \)-dimensional vector is represented as an \( n \times 1 \) column vector. To explicitly represent a row vector (a matrix with 1 row and \( n \) columns), we write \( \mathbf{x}^T \), where \( \mathbf{x}^T \) denotes the transpose of \( \mathbf{x} \).
- The \( i \)-th element of a vector \( \mathbf{x} \) is denoted \( x_i \): \[ \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}. \]
- We use \( a_{ij} \) (or \( A_{ij}, A_{i,j} \)) to denote the entry of \( A \) in the \( i \)-th row and \( j \)-th column: \[ A = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix}. \]
- The \( j \)-th column of \( A \) is denoted by \( \mathbf{a}_j \) or \( A_{:,j} \): \[ A = \begin{bmatrix} | & | & & | \\ \mathbf{a}_1 & \mathbf{a}_2 & \cdots & \mathbf{a}_n \\ | & | & & | \end{bmatrix}. \]
- The \( i \)-th row of \( A \) is denoted by \( \mathbf{a}_i^T \) or \( A_{i,:} \): \[ A = \begin{bmatrix} β & \mathbf{a}_1^T & β \\ β & \mathbf{a}_2^T & β \\ & \vdots & \\ β & \mathbf{a}_m^T & β \end{bmatrix}. \]
- Note that these definitions may appear ambiguous (e.g., \( \mathbf{a}_1 \) and \( \mathbf{a}_1^T \) are not the same vector). However, the meaning of the notation should be clear from its context.
2. Matrix Multiplication
The product of two matrices \( A \in \mathbb{R}^{m \times n} \) and \( B \in \mathbb{R}^{n \times p} \) is the matrix \[ C = AB \in \mathbb{R}^{m \times p}, \] where \[ C_{ij} = \sum_{k=1}^n A_{ik}B_{kj}. \] Note that for the matrix product to exist, the number of columns in \( A \) must equal the number of rows in \( B \). There are many ways to interpret matrix multiplication, and weβll start by examining a few special cases.
2.1 Vector-Vector Products
Given two vectors \( \mathbf{x}, \mathbf{y} \in \mathbb{R}^n \), the quantity \( \mathbf{x}^T\mathbf{y} \), sometimes called the inner product or dot product of the vectors, is a real number given by \[ \mathbf{x}^T\mathbf{y} = \begin{bmatrix} x_1 & x_2 & \cdots & x_n \end{bmatrix} \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} = \sum_{i=1}^n x_i y_i. \]
Observe that inner products are a special case of matrix multiplication. It is always true that \( \mathbf{x}^T\mathbf{y} = \mathbf{y}^T\mathbf{x} \).
Given vectors \( \mathbf{x} \in \mathbb{R}^m \) and \( \mathbf{y} \in \mathbb{R}^n \) (not necessarily of the same size), \( \mathbf{x}\mathbf{y}^T \in \mathbb{R}^{m \times n} \) is called the outer product of the vectors. It is a matrix whose entries are given by \( (\mathbf{x}\mathbf{y}^T)_{ij} = x_i y_j \), i.e., \[ \mathbf{x}\mathbf{y}^T = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_m \end{bmatrix} \begin{bmatrix} y_1 & y_2 & \cdots & y_n \end{bmatrix} = \begin{bmatrix} x_1y_1 & x_1y_2 & \cdots & x_1y_n \\ x_2y_1 & x_2y_2 & \cdots & x_2y_n \\ \vdots & \vdots & \ddots & \vdots \\ x_my_1 & x_my_2 & \cdots & x_my_n \end{bmatrix}. \]
As an example of how the outer product can be useful, let \( \mathbf{1} \in \mathbb{R}^n \) denote an \( n \)-dimensional vector whose entries are all equal to 1. Furthermore, consider the matrix \( A \in \mathbb{R}^{m \times n} \) whose columns are all equal to some vector \( \mathbf{x} \in \mathbb{R}^m \). Using outer products, we can represent \( A \) compactly as \[ A = \begin{bmatrix} | & | & & | \\ \mathbf{x} & \mathbf{x} & \cdots & \mathbf{x} \\ | & | & & | \end{bmatrix} = \begin{bmatrix} x_1 & x_1 & \cdots & x_1 \\ x_2 & x_2 & \cdots & x_2 \\ \vdots & \vdots & \ddots & \vdots \\ x_m & x_m & \cdots & x_m \end{bmatrix} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_m \end{bmatrix} \begin{bmatrix} 1 & 1 & \cdots & 1 \end{bmatrix} = \mathbf{x}\mathbf{1}^T. \]
2.2 Matrix-Vector Products
Given a matrix \( A \in \mathbb{R}^{m \times n} \) and a vector \( \mathbf{x} \in \mathbb{R}^n \), their product is a vector \( \mathbf{y} = A\mathbf{x} \in \mathbb{R}^m \). There are several ways to interpret matrix-vector multiplication, and we will examine each in turn.
If we write \( A \) by rows, then we can express \( A\mathbf{x} \) as \[ \mathbf{y} = A\mathbf{x} = \begin{bmatrix} β & \mathbf{a}_1^T & β \\ β & \mathbf{a}_2^T & β \\ & \vdots & \\ β & \mathbf{a}_m^T & β \end{bmatrix} \mathbf{x} = \begin{bmatrix} \mathbf{a}_1^T\mathbf{x} \\ \mathbf{a}_2^T\mathbf{x} \\ \vdots \\ \mathbf{a}_m^T\mathbf{x} \end{bmatrix}. \] In other words, the \( i \)-th entry of \( \mathbf{y} \) is equal to the inner product of the \( i \)-th row of \( A \) and \( \mathbf{x} \), \[ y_i = \mathbf{a}_i^T\mathbf{x}. \]
Alternatively, if we write \( A \) in column form, we see that \[ \mathbf{y} = A\mathbf{x} = \begin{bmatrix} | & | & & | \\ \mathbf{a}_1 & \mathbf{a}_2 & \cdots & \mathbf{a}_n \\ | & | & & | \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} = \mathbf{a}_1x_1 + \mathbf{a}_2x_2 + \ldots + \mathbf{a}_nx_n. \] In other words, \( \mathbf{y} \) is a linear combination of the columns of \( A \), where the coefficients of the linear combination are given by the entries of \( \mathbf{x} \).
So far, we have been multiplying on the right by a column vector, but it is also possible to multiply on the left by a row vector. This is written as \( \mathbf{y}^T = \mathbf{x}^TA \) for \( A \in \mathbb{R}^{m \times n} \), \( \mathbf{x} \in \mathbb{R}^m \), and \( \mathbf{y} \in \mathbb{R}^n \). As before, we can express \( \mathbf{y}^T \) in two obvious ways, depending on whether we express \( A \) in terms of its rows or columns.
2.3 Matrix-Matrix Products
Armed with this knowledge, we can now examine four different (but equivalent) ways of viewing the matrix-matrix multiplication \( C = AB \) as defined at the beginning of this section.
First, we can view matrix-matrix multiplication as a set of vector-vector products. The most obvious viewpoint, which follows immediately from the definition, is that the \( (i, j) \)-th entry of \( C \) is equal to the inner product of the \( i \)-th row of \( A \) and the \( j \)-th column of \( B \). Symbolically, this looks like the following: \[ C = AB = \begin{bmatrix} β & \mathbf{a}_1^T & β \\ β & \mathbf{a}_2^T & β \\ & \vdots & \\ β & \mathbf{a}_m^T & β \end{bmatrix} \begin{bmatrix} | & | & & | \\ \mathbf{b}_1 & \mathbf{b}_2 & \cdots & \mathbf{b}_p \\ | & | & & | \end{bmatrix} = \begin{bmatrix} \mathbf{a}_1^T\mathbf{b}_1 & \mathbf{a}_1^T\mathbf{b}_2 & \cdots & \mathbf{a}_1^T\mathbf{b}_p \\ \mathbf{a}_2^T\mathbf{b}_1 & \mathbf{a}_2^T\mathbf{b}_2 & \cdots & \mathbf{a}_2^T\mathbf{b}_p \\ \vdots & \vdots & \ddots & \vdots \\ \mathbf{a}_m^T\mathbf{b}_1 & \mathbf{a}_m^T\mathbf{b}_2 & \cdots & \mathbf{a}_m^T\mathbf{b}_p \end{bmatrix}. \]
Remember that since \( A \in \mathbb{R}^{m \times n} \) and \( B \in \mathbb{R}^{n \times p} \), \( \mathbf{a}_i \in \mathbb{R}^n \) and \( \mathbf{b}_j \in \mathbb{R}^n \), so these inner products all make sense. This is the most βnaturalβ representation when we represent \( A \) by rows and \( B \) by columns.
Alternatively, we can represent \( A \) by columns and \( B \) by rows. This leads to a trickier interpretation of \( AB \) as a sum of outer products. Symbolically, \[ C = AB = \begin{bmatrix} | & | & & | \\ \mathbf{a}_1 & \mathbf{a}_2 & \cdots & \mathbf{a}_n \\ | & | & & | \end{bmatrix} \begin{bmatrix} β & \mathbf{b}_1^T & β \\ β & \mathbf{b}_2^T & β \\ & \vdots & \\ β & \mathbf{b}_n^T & β \end{bmatrix} = \sum_{i=1}^n \mathbf{a}_i\mathbf{b}_i^T. \] Put another way, \( AB \) is equal to the sum, over all \( i \), of the outer product of the \( i \)-th column of \( A \) and the \( i \)-th row of \( B \). Since \( \mathbf{a}_i \in \mathbb{R}^m \) and \( \mathbf{b}_i \in \mathbb{R}^p \), the dimension of the outer product \( \mathbf{a}_i\mathbf{b}_i^T \) is \( m \times p \), which coincides with the dimension of \( C \).
3. Operations and Properties
In this section, we present several operations and properties of matrices and vectors. Hopefully, much of this will be a review for you, so these notes can serve as a reference for these topics.
3.1 The Identity Matrix and Diagonal Matrices
The identity matrix, denoted \( I \in \mathbb{R}^{n \times n} \), is a square matrix with ones on the diagonal and zeros everywhere else. That is, \[ I_{ij} = \begin{cases} 1 & i = j \\ 0 & i \neq j \end{cases}. \] It has the property that for all \( A \in \mathbb{R}^{m \times n} \), \[ AI = A = IA. \] Note that the notation for the identity matrix is somewhat ambiguous since it does not specify its dimension. Generally, the dimensions of \( I \) are inferred from context to ensure matrix multiplication is valid. For example, in the equation above, the \( I \) in \( AI = A \) is an \( n \times n \) matrix, whereas the \( I \) in \( A = IA \) is an \( m \times m \) matrix.
A diagonal matrix is a matrix where all non-diagonal elements are zero. This is typically denoted \( D = \text{diag}(d_1, d_2, \ldots, d_n) \), with \[ D_{ij} = \begin{cases} d_i & i = j \\ 0 & i \neq j \end{cases}. \] Clearly, \( I = \text{diag}(1, 1, \ldots, 1) \).
3.2 The Transpose
The transpose of a matrix results from "flipping" the rows and columns. Given a matrix \( A \in \mathbb{R}^{m \times n} \), its transpose, written \( A^T \in \mathbb{R}^{n \times m} \), is the \( n \times m \) matrix whose entries are given by \[ (A^T)_{ij} = A_{ji}. \] We have already been using the transpose when describing row vectors, since the transpose of a column vector is naturally a row vector.
The following properties of transposes are easily verified:
- \( (A^T)^T = A \)
- \( (AB)^T = B^TA^T \)
- \( (A + B)^T = A^T + B^T \)
3.3 Symmetric Matrices
A square matrix \( A \in \mathbb{R}^{n \times n} \) is symmetric if \( A = A^T \). It is anti-symmetric if \( A = -A^T \).
It is easy to show that for any matrix \( A \in \mathbb{R}^{n \times n} \), the matrix \( A + A^T \) is symmetric and the matrix \( A - A^T \) is anti-symmetric. From this, it follows that any square matrix \( A \in \mathbb{R}^{n \times n} \) can be represented as a sum of a symmetric matrix and an anti-symmetric matrix, since \[ A = \frac{1}{2}(A + A^T) + \frac{1}{2}(A - A^T), \] where the first term is symmetric and the second term is anti-symmetric.
Symmetric matrices occur frequently in practice and have many useful properties, which we will explore shortly. It is common to denote the set of all symmetric matrices of size \( n \) as \( S^n \), so that \( A \in S^n \) means that \( A \) is a symmetric \( n \times n \) matrix.
3.4 The Trace
The trace of a square matrix \( A \in \mathbb{R}^{n \times n} \), denoted \( \text{tr}(A) \) (or simply \( \text{tr}A \)), is the sum of its diagonal elements: \[ \text{tr}A = \sum_{i=1}^n A_{ii}. \]
As described in the CS229 lecture notes, the trace has the following properties (included here for completeness):
- For \( A \in \mathbb{R}^{n \times n} \), \( \text{tr}A = \text{tr}A^T \).
- For \( A, B \in \mathbb{R}^{n \times n} \), \( \text{tr}(A + B) = \text{tr}A + \text{tr}B \).
- For \( A \in \mathbb{R}^{n \times n} \), \( t \in \mathbb{R} \), \( \text{tr}(tA) = t \text{tr}A \).
- For \( A, B \) such that \( AB \) is square, \( \text{tr}(AB) = \text{tr}(BA) \).
- For \( A, B, C \) such that \( ABC \) is square, \( \text{tr}(ABC) = \text{tr}(BCA) = \text{tr}(CAB) \), and similarly for products of more matrices.
As an example of how these properties can be proven, consider the fourth property: \( \text{tr}(AB) = \text{tr}(BA) \). Suppose \( A \in \mathbb{R}^{m \times n} \) and \( B \in \mathbb{R}^{n \times m} \) (so that \( AB \in \mathbb{R}^{m \times m} \) is square). Observe that \( BA \in \mathbb{R}^{n \times n} \) is also square, so it makes sense to apply the trace operator to it. To verify that \( \text{tr}(AB) = \text{tr}(BA) \), note that \[ \text{tr}(AB) = \sum_{i=1}^m (AB)_{ii} = \sum_{i=1}^m \left( \sum_{j=1}^n A_{ij}B_{ji} \right) = \sum_{i=1}^m \sum_{j=1}^n A_{ij}B_{ji}. \] Using the commutativity of scalar multiplication and associativity of scalar addition, we can rearrange the terms: \[ \sum_{i=1}^m \sum_{j=1}^n A_{ij}B_{ji} = \sum_{j=1}^n \sum_{i=1}^m B_{ji}A_{ij} = \sum_{j=1}^n (BA)_{jj} = \text{tr}(BA). \]
3.5 Norms
A norm of a vector \( \| \mathbf{x} \| \) is informally a measure of the "length" of the vector. For example, the commonly-used Euclidean or \( \ell_2 \) norm is defined as \[ \| \mathbf{x} \|_2 = \sqrt{\sum_{i=1}^n x_i^2}. \] Note that \( \| \mathbf{x} \|_2^2 = \mathbf{x}^T \mathbf{x} \).
More formally, a norm is any function \( f : \mathbb{R}^n \to \mathbb{R} \) that satisfies the following four properties:
- For all \( \mathbf{x} \in \mathbb{R}^n \), \( f(\mathbf{x}) \geq 0 \) (non-negativity).
- \( f(\mathbf{x}) = 0 \) if and only if \( \mathbf{x} = \mathbf{0} \) (definiteness).
- For all \( \mathbf{x} \in \mathbb{R}^n \), \( t \in \mathbb{R} \), \( f(t\mathbf{x}) = |t|f(\mathbf{x}) \) (homogeneity).
- For all \( \mathbf{x}, \mathbf{y} \in \mathbb{R}^n \), \( f(\mathbf{x} + \mathbf{y}) \leq f(\mathbf{x}) + f(\mathbf{y}) \) (triangle inequality).
Other examples of norms include the \( \ell_1 \) norm, \[ \| \mathbf{x} \|_1 = \sum_{i=1}^n |x_i|, \] and the \( \ell_\infty \) norm, \[ \| \mathbf{x} \|_\infty = \max_i |x_i|. \]
In fact, all three norms presented so far are examples of the family of \( \ell_p \) norms, parameterized by a real number \( p \geq 1 \), and defined as \[ \| \mathbf{x} \|_p = \left( \sum_{i=1}^n |x_i|^p \right)^{1/p}. \]
Norms can also be defined for matrices, such as the Frobenius norm, \[ \| A \|_F = \sqrt{\sum_{i=1}^m \sum_{j=1}^n A_{ij}^2} = \sqrt{\text{tr}(A^T A)}. \]
3.6 Linear Independence and Rank
A set of vectors \( \{\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n\} \subset \mathbb{R}^m \) is said to be (linearly) independent if no vector can be represented as a linear combination of the remaining vectors. Conversely, if one vector belonging to the set can be represented as a linear combination of the remaining vectors, then the vectors are said to be (linearly) dependent. That is, if \[ \mathbf{x}_n = \sum_{i=1}^{n-1} \alpha_i \mathbf{x}_i \] for some scalar values \( \alpha_1, \ldots, \alpha_{n-1} \in \mathbb{R} \), then the vectors \( \mathbf{x}_1, \ldots, \mathbf{x}_n \) are linearly dependent; otherwise, they are linearly independent.
For example, the vectors \[ \mathbf{x}_1 = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix}, \quad \mathbf{x}_2 = \begin{bmatrix} 4 \\ 1 \\ 5 \end{bmatrix}, \quad \mathbf{x}_3 = \begin{bmatrix} 2 \\ -3 \\ -1 \end{bmatrix} \] are linearly dependent because \( \mathbf{x}_3 = -2\mathbf{x}_1 + \mathbf{x}_2 \).
The column rank of a matrix \( A \in \mathbb{R}^{m \times n} \) is the size of the largest subset of columns of \( A \) that constitute a linearly independent set. Similarly, the row rank is the largest number of rows of \( A \) that constitute a linearly independent set.
For any matrix \( A \in \mathbb{R}^{m \times n} \), the column rank of \( A \) is equal to the row rank of \( A \) (though we will not prove this), and both quantities are referred to collectively as the rank of \( A \), denoted \( \text{rank}(A) \). The following are some basic properties of the rank:
- For \( A \in \mathbb{R}^{m \times n} \), \( \text{rank}(A) \leq \min(m, n) \). If \( \text{rank}(A) = \min(m, n) \), then \( A \) is said to be full rank.
- For \( A \in \mathbb{R}^{m \times n} \), \( \text{rank}(A) = \text{rank}(A^T) \).
- For \( A \in \mathbb{R}^{m \times n} \), \( B \in \mathbb{R}^{n \times p} \), \( \text{rank}(AB) \leq \min(\text{rank}(A), \text{rank}(B)) \).
- For \( A, B \in \mathbb{R}^{m \times n} \), \( \text{rank}(A + B) \leq \text{rank}(A) + \text{rank}(B) \).
3.7 The Inverse
The inverse of a square matrix \( A \in \mathbb{R}^{n \times n} \) is denoted \( A^{-1} \), and is the unique matrix such that \[ A^{-1}A = I = AA^{-1}. \] Note that not all matrices have inverses. Non-square matrices, for example, do not have inverses by definition. However, for some square matrices \( A \), it may still be the case that \( A^{-1} \) does not exist. In particular, we say that \( A \) is invertible or non-singular if \( A^{-1} \) exists, and non-invertible or singular otherwise.
For a square matrix \( A \) to have an inverse \( A^{-1} \), \( A \) must be full rank. We will soon see that there are many alternative sufficient and necessary conditions, in addition to full rank, for invertibility.
The following are properties of the inverse; all assume that \( A, B \in \mathbb{R}^{n \times n} \) are non-singular:
- \( (A^{-1})^{-1} = A \)
- \( (AB)^{-1} = B^{-1}A^{-1} \)
- \( (A^{-1})^T = (A^T)^{-1} \). For this reason, this matrix is often denoted \( A^{-T} \).
As an example of how the inverse is used, consider the linear system of equations \( A\mathbf{x} = \mathbf{b} \), where \( A \in \mathbb{R}^{n \times n} \), and \( \mathbf{x}, \mathbf{b} \in \mathbb{R}^n \). If \( A \) is nonsingular (i.e., invertible), then \( \mathbf{x} = A^{-1}\mathbf{b} \). (What if \( A \in \mathbb{R}^{m \times n} \) is not a square matrix? Does this work?)
3.8 Orthogonal Matrices
Two vectors \( \mathbf{x}, \mathbf{y} \in \mathbb{R}^n \) are orthogonal if \( \mathbf{x}^T\mathbf{y} = 0 \). A vector \( \mathbf{x} \in \mathbb{R}^n \) is normalized if \( \| \mathbf{x} \|_2 = 1 \). A square matrix \( U \in \mathbb{R}^{n \times n} \) is orthogonal (note the different meanings when talking about vectors versus matrices) if all its columns are orthogonal to each other and are normalized (the columns are then referred to as being orthonormal).
It follows immediately from the definition of orthogonality and normality that \[ U^T U = I = U U^T. \] In other words, the inverse of an orthogonal matrix is its transpose. Note that if \( U \) is not square β i.e., \( U \in \mathbb{R}^{m \times n}, n < m \) β but its columns are still orthonormal, then \( U^T U=I \), but \( U U^T \neq I \). We generally only use the term orthogonal to describe the previous case, where \( U \) is square.
Another nice property of orthogonal matrices is that operating on a vector with an orthogonal matrix will not change its Euclidean norm, i.e., \[ \| U\mathbf{x} \|_2 = \| \mathbf{x} \|_2 \] for any \( \mathbf{x} \in \mathbb{R}^n \), \( U \in \mathbb{R}^{n \times n} \) orthogonal.
3.9 Range and Nullspace of a Matrix
The span of a set of vectors \( \{\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n\} \) is the set of all vectors that can be expressed as a linear combination of \( \{\mathbf{x}_1, \ldots, \mathbf{x}_n\} \). That is, \[ \text{span}(\{\mathbf{x}_1, \ldots, \mathbf{x}_n\}) = \left\{ \mathbf{v} : \mathbf{v} = \sum_{i=1}^n \alpha_i \mathbf{x}_i, \alpha_i \in \mathbb{R} \right\}. \]
It can be shown that if \( \{\mathbf{x}_1, \ldots, \mathbf{x}_n\} \) is a set of \( n \) linearly independent vectors, where each \( \mathbf{x}_i \in \mathbb{R}^n \), then \( \text{span}(\{\mathbf{x}_1, \ldots, \mathbf{x}_n\}) = \mathbb{R}^n \). In other words, any vector \( \mathbf{v} \in \mathbb{R}^n \) can be written as a linear combination of \( \mathbf{x}_1 \) through \( \mathbf{x}_n \).
The projection of a vector \( \mathbf{y} \in \mathbb{R}^m \) onto the span of \( \{\mathbf{x}_1, \ldots, \mathbf{x}_n\} \) (here we assume \( \mathbf{x}_i \in \mathbb{R}^m \)) is the vector \( \mathbf{v} \in \text{span}(\{\mathbf{x}_1, \ldots, \mathbf{x}_n\}) \), such that \( \mathbf{v} \) is as close as possible to \( \mathbf{y} \), as measured by the Euclidean norm \( \| \mathbf{v} - \mathbf{y} \|_2 \). We denote the projection as \( \text{Proj}(\mathbf{y}; \{\mathbf{x}_1, \ldots, \mathbf{x}_n\}) \) and can define it formally as, \[ \text{Proj}(\mathbf{y}; \{\mathbf{x}_1, \ldots, \mathbf{x}_n\}) = \arg\min_{\mathbf{v} \in \text{span}(\{\mathbf{x}_1, \ldots, \mathbf{x}_n\})} \| \mathbf{y} - \mathbf{v} \|_2. \]
The range (sometimes also called the columnspace) of a matrix \( A \in \mathbb{R}^{m \times n} \), denoted \( R(A) \), is the span of the columns of \( A \). In other words, \[ R(A) = \{\mathbf{v} \in \mathbb{R}^m : \mathbf{v} = A\mathbf{x}, \mathbf{x} \in \mathbb{R}^n\}. \]
Making a few technical assumptions (namely that \( A \) is full rank and that \( n < m \)), the projection of a vector \( \mathbf{y} \in \mathbb{R}^m \) onto the range of \( A \) is given by, \[ \text{Proj}(\mathbf{y}; A)=\arg\min_{\mathbf{v} \in R(A)} \| \mathbf{v} - \mathbf{y} \|_2=A(A^T A)^{-1}A^T \mathbf{y}. \] This last equation should look extremely familiar, since it is almost the same formula we derived in class (and which we will soon derive again) for the least squares estimation of parameters.
When \( A \) contains only a single column, \( \mathbf{a} \in \mathbb{R}^m \), this gives the special case for a projection of a vector onto a line: \[ \text{Proj}(\mathbf{y}; \mathbf{a}) = \frac{\mathbf{a}\mathbf{a}^T}{\mathbf{a}^T \mathbf{a}} \mathbf{y}. \]
The nullspace of a matrix \( A \in \mathbb{R}^{m \times n} \), denoted \( N(A) \), is the set of all vectors that equal \( 0 \) when multiplied by \( A \), i.e., \[ N(A) = \{\mathbf{x} \in \mathbb{R}^n : A\mathbf{x} = \mathbf{0}\}. \] Note that vectors in \( R(A) \) are of size \( m \), while vectors in \( N(A) \) are of size \( n \), so vectors in \( R(A^T) \) and \( N(A) \) are both in \( \mathbb{R}^n \). It turns out that \[ \{\mathbf{w} : \mathbf{w} = \mathbf{u} + \mathbf{v}, \mathbf{u} \in R(A^T), \mathbf{v} \in N(A)\} = \mathbb{R}^n \] and \[ R(A^T) \cap N(A) = \emptyset. \] In other words, \( R(A^T) \) and \( N(A) \) are disjoint subsets that together span the entire space of \( \mathbb{R}^n \). Sets of this type are called orthogonal complements, and we denote this \( R(A^T) = N(A)^\perp \).
3.10 The Determinant
The determinant of a square matrix \( A \in \mathbb{R}^{n \times n} \), is a function \( \det : \mathbb{R}^{n \times n} \to \mathbb{R} \), and is denoted \( |A| \) or \( \det A \) (like the trace operator, we usually omit parentheses). Algebraically, one could write down an explicit formula for the determinant of \( A \), but this unfortunately gives little intuition about its meaning. Instead, weβll start by providing a geometric interpretation of the determinant and then visit some of its specific algebraic properties afterwards.
Given a matrix \[ \begin{bmatrix} β & \mathbf{a}_1^T & β \\ β & \mathbf{a}_2^T & β \\ & \vdots & \\ β & \mathbf{a}_n^T & β \end{bmatrix}, \] consider the set of points \( S \subset \mathbb{R}^n \) formed by taking all possible linear combinations of the row vectors \( \mathbf{a}_1, \ldots, \mathbf{a}_n \in \mathbb{R}^n \) of \( A \), where the coefficients of the linear combination are all between 0 and 1; that is, the set \( S \) is the restriction of \( \text{span}(\{\mathbf{a}_1, \ldots, \mathbf{a}_n\}) \) to only those linear combinations whose coefficients \( \alpha_1, \ldots, \alpha_n \) satisfy \( 0 \leq \alpha_i \leq 1 \), \( i = 1, \ldots, n \). Formally, \[ S = \{\mathbf{v} \in \mathbb{R}^n : \mathbf{v} = \sum_{i=1}^n \alpha_i \mathbf{a}_i \text{ where } 0 \leq \alpha_i \leq 1, i = 1, \ldots, n\}. \]
The absolute value of the determinant of \( A \), it turns out, is a measure of the βvolumeβ of the set \( S \). For
example, consider the \( 2 \times 2 \) matrix,
\[
A =
\begin{bmatrix}
1 & 3 \\
3 & 2
\end{bmatrix}.
\]
Here, the rows of the matrix are
\[
\mathbf{a}_1 =
\begin{bmatrix}
1 \\ 3
\end{bmatrix},
\quad
\mathbf{a}_2 =
\begin{bmatrix}
3 \\ 2
\end{bmatrix}.
\]
Check out the code and output, and try running the notebook yourself on Colab:
Algebraically, the determinant satisfies the following three properties (from which all other properties follow, including the general formula):
- The determinant of the identity is 1, \( |I| = 1 \). (Geometrically, the volume of a unit hypercube is 1).
- Given a matrix \( A \in \mathbb{R}^{n \times n} \), if we multiply a single row in \( A \) by a scalar \( t \in \mathbb{R} \), then the determinant of the new matrix is \( t|A| \).
- If we exchange any two rows \( \mathbf{a}_i^T \) and \( \mathbf{a}_j^T \) of \( A \), then the determinant of the new matrix is \( -|A| \).
Several properties that follow from the three properties above include:
- For \( A \in \mathbb{R}^{n \times n} \), \( |A| = |A^T| \).
- For \( A, B \in \mathbb{R}^{n \times n} \), \( |AB| = |A||B| \).
- For \( A \in \mathbb{R}^{n \times n} \), \( |A| = 0 \) if and only if \( A \) is singular (i.e., non-invertible).
- For \( A \in \mathbb{R}^{n \times n} \) and \( A \) non-singular, \( |A^{-1}| = 1/|A| \).
3.11 Quadratic Forms and Positive Semidefinite Matrices
Given a square matrix \( A \in \mathbb{R}^{n \times n} \) and a vector \( \mathbf{x} \in \mathbb{R}^n \), the scalar value \( \mathbf{x}^T A \mathbf{x} \) is called a quadratic form. Written explicitly, we see that \[ \mathbf{x}^T A \mathbf{x} = \sum_{i=1}^n x_i (A\mathbf{x})_i = \sum_{i=1}^n x_i \left( \sum_{j=1}^n A_{ij} x_j \right) = \sum_{i=1}^n \sum_{j=1}^n A_{ij} x_i x_j. \]
Note that, \[ \mathbf{x}^T A \mathbf{x} = (\mathbf{x}^T A \mathbf{x})^T = \mathbf{x}^T A^T \mathbf{x} = \mathbf{x}^T \left( \frac{1}{2} A + \frac{1}{2} A^T \right) \mathbf{x}, \] where the first equality follows from the fact that the transpose of a scalar is equal to itself, and the second equality follows from averaging two quantities that are themselves equal. From this, we can conclude that only the symmetric part of \( A \) contributes to the quadratic form. For this reason, we often implicitly assume that the matrices appearing in a quadratic form are symmetric.
We give the following definitions:
- A symmetric matrix \( A \in S^n \) is positive definite (PD) if for all non-zero vectors \( \mathbf{x} \in \mathbb{R}^n \), \( \mathbf{x}^T A \mathbf{x} > 0 \). This is usually denoted \( A \succ 0 \) (or just \( A > 0 \)), and the set of all positive definite matrices is denoted \( S^n_{++} \).
- A symmetric matrix \( A \in S^n \) is positive semidefinite (PSD) if for all vectors \( \mathbf{x} \), \( \mathbf{x}^T A \mathbf{x} \geq 0 \). This is written \( A \succeq 0 \) (or just \( A \geq 0 \)), and the set of all positive semidefinite matrices is often denoted \( S^n_+ \).
- Likewise, a symmetric matrix \( A \in S^n \) is negative definite (ND), denoted \( A \prec 0 \) (or just \( A < 0 \)) if for all non-zero \( \mathbf{x} \in \mathbb{R}^n \), \( \mathbf{x}^T A \mathbf{x} < 0 \).
- Similarly, a symmetric matrix \( A \in S^n \) is negative semidefinite (NSD), denoted \( A \preceq 0 \) (or just \( A \leq 0 \)) if for all \( \mathbf{x} \in \mathbb{R}^n \), \( \mathbf{x}^T A \mathbf{x} \leq 0 \).
- Finally, a symmetric matrix \( A \in S^n \) is indefinite if it is neither positive semidefinite nor negative semidefinite β i.e., if there exists \( \mathbf{x}_1, \mathbf{x}_2 \in \mathbb{R}^n \) such that \( \mathbf{x}_1^T A \mathbf{x}_1 > 0 \) and \( \mathbf{x}_2^T A \mathbf{x}_2 < 0 \).
It should be obvious that if \( A \) is positive definite, then \( -A \) is negative definite and vice versa. Likewise, if \( A \) is positive semidefinite, then \( -A \) is negative semidefinite and vice versa. If \( A \) is indefinite, then so is \( -A \).
One important property of positive definite and negative definite matrices is that they are always full rank, and hence invertible. To see why this is the case, suppose that some matrix \( A \in \mathbb{R}^{n \times n} \) is not full rank. Then, suppose that the \( j \)-th column of \( A \) is expressible as a linear combination of other \( n - 1 \) columns: \[ \mathbf{a}_j = \sum_{i \neq j} x_i \mathbf{a}_i, \] for some \( x_1, \ldots, x_{j-1}, x_{j+1}, \ldots, x_n \in \mathbb{R} \). Setting \( x_j = -1 \), we have \[ A\mathbf{x} = \sum_{i=1}^n x_i \mathbf{a}_i = \mathbf{0}. \] But this implies \( \mathbf{x}^T A \mathbf{x} = 0 \) for some non-zero vector \( \mathbf{x} \), so \( A \) must be neither positive definite nor negative definite. Therefore, if \( A \) is either positive definite or negative definite, it must be full rank.
Finally, there is one type of positive definite matrix that comes up frequently, and so deserves special mention. Given any matrix \( A \in \mathbb{R}^{m \times n} \) (not necessarily symmetric or even square), the matrix \( G = A^T A \) (sometimes called a Gram matrix) is always positive semidefinite. Further, if \( m \geq n \) (and we assume for convenience that \( A \) is full rank), then \( G = A^T A \) is positive definite.
3.12 Eigenvalues and Eigenvectors
Given a square matrix \( A \in \mathbb{R}^{n \times n} \), we say that \( \lambda \in \mathbb{C} \) is an eigenvalue of \( A \) and \( \mathbf{x} \in \mathbb{C}^n \) is the corresponding eigenvector if \[ A\mathbf{x} = \lambda\mathbf{x}, \quad \mathbf{x} \neq \mathbf{0}. \] Intuitively, this definition means that multiplying \( A \) by the vector \( \mathbf{x} \) results in a new vector that points in the same direction as \( \mathbf{x} \), but scaled by a factor \( \lambda \). Also note that for any eigenvector \( \mathbf{x} \in \mathbb{C}^n \), and scalar \( t \in \mathbb{C} \), \[ A(c\mathbf{x}) = cA\mathbf{x} = c\lambda\mathbf{x} = \lambda(c\mathbf{x}), \] so \( c\mathbf{x} \) is also an eigenvector. For this reason, when we talk about βtheβ eigenvector associated with \( \lambda \), we usually assume that the eigenvector is normalized to have length 1 (this still creates some ambiguity, since \( \mathbf{x} \) and \( -\mathbf{x} \) will both be eigenvectors, but we will have to live with this).
We can rewrite the equation above to state that \( (\lambda, \mathbf{x}) \) is an eigenvalue-eigenvector pair of \( A \) if, \[ (\lambda I - A)\mathbf{x} = \mathbf{0}, \quad \mathbf{x} \neq \mathbf{0}. \] But \( (\lambda I - A)\mathbf{x} = \mathbf{0} \) has a non-zero solution for \( \mathbf{x} \) if and only if \( (\lambda I - A) \) has a non-empty nullspace, which is only the case if \( (\lambda I - A) \) is singular, i.e., \[ |(\lambda I - A)| = 0. \]
We can now use the previous definition of the determinant to expand this expression into a (very large) polynomial in \( \lambda \), where \( \lambda \) will have maximum degree \( n \). We then find the \( n \) (possibly complex) roots of this polynomial to determine the \( n \) eigenvalues \( \lambda_1, \ldots, \lambda_n \). To find the eigenvector corresponding to the eigenvalue \( \lambda_i \), we simply solve the linear equation \( (\lambda_i I - A)\mathbf{x} = \mathbf{0} \). It should be noted that this is not the method actually used in practice to numerically compute the eigenvalues and eigenvectors (remember that the complete expansion of the determinant has \( n! \) terms); it is rather a mathematical argument.
The following are properties of eigenvalues and eigenvectors (in all cases assume \( A \in \mathbb{R}^{n \times n} \) has eigenvalues \( \lambda_1, \ldots, \lambda_n \) and associated eigenvectors \( \mathbf{x}_1, \ldots, \mathbf{x}_n \)):
- The trace of \( A \) is equal to the sum of its eigenvalues, \[ \text{tr}(A) = \sum_{i=1}^n \lambda_i. \]
- The determinant of \( A \) is equal to the product of its eigenvalues, \[ |A| = \prod_{i=1}^n \lambda_i. \]
- The rank of \( A \) is equal to the number of non-zero eigenvalues of \( A \).
- If \( A \) is non-singular, then \( 1/\lambda_i \) is an eigenvalue of \( A^{-1} \) with associated eigenvector \( \mathbf{x}_i \), i.e., \[ A^{-1}\mathbf{x}_i = (1/\lambda_i)\mathbf{x}_i. \] (To prove this, take the eigenvector equation \( A\mathbf{x}_i = \lambda_i\mathbf{x}_i \) and left-multiply each side by \( A^{-1} \).)
- The eigenvalues of a diagonal matrix \( D = \text{diag}(d_1, \ldots, d_n) \) are just the diagonal entries \( d_1, \ldots, d_n \).
We can write all the eigenvector equations simultaneously as \[ AX = X\Lambda, \] where the columns of \( X \in \mathbb{R}^{n \times n} \) are the eigenvectors of \( A \) and \( \Lambda \) is a diagonal matrix whose entries are the eigenvalues of \( A \), i.e., \[ X = \begin{bmatrix} | & | & & | \\ \mathbf{x}_1 & \mathbf{x}_2 & \cdots & \mathbf{x}_n \\ | & | & & | \end{bmatrix}, \quad \Lambda = \text{diag}(\lambda_1, \ldots, \lambda_n). \] If the eigenvectors of \( A \) are linearly independent, then the matrix \( X \) will be invertible, so \[ A = X\Lambda X^{-1}. \] A matrix that can be written in this form is called diagonalizable.
3.13 Eigenvalues and Eigenvectors of Symmetric Matrices
Two remarkable properties arise when we examine the eigenvalues and eigenvectors of a symmetric matrix \( A \in S^n \). First, it can be shown that all the eigenvalues of \( A \) are real. Secondly, the eigenvectors of \( A \) are orthonormal, i.e., the matrix \( X \) defined earlier is an orthogonal matrix (for this reason, we denote the matrix of eigenvectors as \( U \) in this case).
We can therefore represent \( A \) as \[ A = U \Lambda U^T, \] remembering from above that the inverse of an orthogonal matrix is just its transpose.
Using this, we can show that the definiteness of a matrix depends entirely on the sign of its eigenvalues. Suppose \( A \in S^n = U \Lambda U^T \). Then \[ \mathbf{x}^T A \mathbf{x} = \mathbf{x}^T U \Lambda U^T \mathbf{x} = \mathbf{y}^T \Lambda \mathbf{y} = \sum_{i=1}^n \lambda_i y_i^2, \] where \( \mathbf{y} = U^T \mathbf{x} \) (and since \( U \) is full rank, any vector \( \mathbf{y} \in \mathbb{R}^n \) can be represented in this form). Because \( y_i^2 \) is always positive, the sign of this expression depends entirely on the \( \lambda_i \)'s:
- If all \( \lambda_i > 0 \), then the matrix is positive definite.
- If all \( \lambda_i \geq 0 \), it is positive semidefinite.
- Likewise, if all \( \lambda_i < 0 \) or \( \lambda_i \leq 0 \), then \( A \) is negative definite or negative semidefinite, respectively.
- Finally, if \( A \) has both positive and negative eigenvalues, it is indefinite.
An application where eigenvalues and eigenvectors come up frequently is in maximizing some function of a matrix. In particular, for a matrix \( A \in S^n \), consider the following maximization problem: \[ \max_{\mathbf{x} \in \mathbb{R}^n} \mathbf{x}^T A \mathbf{x} \quad \text{subject to} \quad \| \mathbf{x} \|_2^2 = 1, \] i.e., we want to find the vector (of norm 1) which maximizes the quadratic form. Assuming the eigenvalues are ordered as \( \lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_n \), the optimal \( \mathbf{x} \) for this optimization problem is \( \mathbf{x}_1 \), the eigenvector corresponding to \( \lambda_1 \). In this case, the maximal value of the quadratic form is \( \lambda_1 \).
Similarly, the optimal solution to the minimization problem, \[ \min_{\mathbf{x} \in \mathbb{R}^n} \mathbf{x}^T A \mathbf{x} \quad \text{subject to} \quad \| \mathbf{x} \|_2^2 = 1, \] is \( \mathbf{x}_n \), the eigenvector corresponding to \( \lambda_n \), and the minimal value is \( \lambda_n \). This can be proved by appealing to the eigenvector-eigenvalue form of \( A \) and the properties of orthogonal matrices. However, in the next section, we will see a way of showing it directly using matrix calculus.
4. Matrix Calculus
While the topics in the previous sections are typically covered in a standard course on linear algebra, one topic that does not seem to be covered very often (and which we will use extensively) is the extension of calculus to the vector setting. Despite the fact that all the actual calculus we use is relatively trivial, the notation can often make things look much more difficult than they are. In this section, we present some basic definitions of matrix calculus and provide a few examples.
4.1 The Gradient
Suppose that \( f : \mathbb{R}^{m \times n} \to \mathbb{R} \) is a function that takes as input a matrix \( A \) of size \( m \times n \) and returns a real value. Then the gradient of \( f \) (with respect to \( A \in \mathbb{R}^{m \times n} \)) is the matrix of partial derivatives, defined as: \[ \nabla_A f(A) \in \mathbb{R}^{m \times n} = \begin{bmatrix} \frac{\partial f(A)}{\partial A_{11}} & \frac{\partial f(A)}{\partial A_{12}} & \cdots & \frac{\partial f(A)}{\partial A_{1n}} \\ \frac{\partial f(A)}{\partial A_{21}} & \frac{\partial f(A)}{\partial A_{22}} & \cdots & \frac{\partial f(A)}{\partial A_{2n}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f(A)}{\partial A_{m1}} & \frac{\partial f(A)}{\partial A_{m2}} & \cdots & \frac{\partial f(A)}{\partial A_{mn}} \end{bmatrix}, \] i.e., an \( m \times n \) matrix with \[ (\nabla_A f(A))_{ij} = \frac{\partial f(A)}{\partial A_{ij}}. \]
Note that the size of \( \nabla_A f(A) \) is always the same as the size of \( A \). So if, in particular, \( A \) is just a vector \( \mathbf{x} \in \mathbb{R}^n \), \[ \nabla_{\mathbf{x}} f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f(\mathbf{x})}{\partial x_1} \\ \frac{\partial f(\mathbf{x})}{\partial x_2} \\ \vdots \\ \frac{\partial f(\mathbf{x})}{\partial x_n} \end{bmatrix}. \]
It is very important to remember that the gradient of a function is only defined if the function is real-valued, that is, if it returns a scalar value. We cannot, for example, take the gradient of \( A\mathbf{x} \), \( A \in \mathbb{R}^{n \times n} \) with respect to \( \mathbf{x} \), since this quantity is vector-valued.
It follows directly from the equivalent properties of partial derivatives that:
- \( \nabla_{\mathbf{x}}(f(\mathbf{x}) + g(\mathbf{x})) = \nabla_{\mathbf{x}} f(\mathbf{x}) + \nabla_{\mathbf{x}} g(\mathbf{x}) \).
- For \( t \in \mathbb{R} \), \( \nabla_{\mathbf{x}}(t f(\mathbf{x})) = t \nabla_{\mathbf{x}} f(\mathbf{x}) \).
In principle, gradients are a natural extension of partial derivatives to functions of multiple variables. In practice, however, working with gradients can sometimes be tricky for notational reasons. For example, suppose that \( A \in \mathbb{R}^{m \times n} \) is a matrix of fixed coefficients and suppose that \( \mathbf{b} \in \mathbb{R}^m \) is a vector of fixed coefficients. Let \( f : \mathbb{R}^m \to \mathbb{R} \) be the function defined by \( f(\mathbf{z}) = \mathbf{z}^T \mathbf{z} \), such that \( \nabla_{\mathbf{z}} f(\mathbf{z}) = 2\mathbf{z} \). But now, consider the expression, \[ \nabla f(A\mathbf{x}). \] How should this expression be interpreted? There are at least two possibilities:
- In the first interpretation, recall that \( \nabla_{\mathbf{z}} f(\mathbf{z}) = 2\mathbf{z} \). Here, we interpret \( \nabla f(A\mathbf{x}) \) as evaluating the gradient at the point \( A\mathbf{x} \), hence, \[ \nabla f(A\mathbf{x}) = 2(A\mathbf{x}) = 2A\mathbf{x} \in \mathbb{R}^m. \]
- In the second interpretation, we consider the quantity \( f(A\mathbf{x}) \) as a function of the input variables \( \mathbf{x} \). More formally, let \( g(\mathbf{x}) = f(A\mathbf{x}) \). Then in this interpretation, \[ \nabla f(A\mathbf{x}) = \nabla_{\mathbf{x}} g(\mathbf{x}) \in \mathbb{R}^n. \]
Here, we can see that these two interpretations are indeed different. One interpretation yields an \( m \)-dimensional vector as a result, while the other interpretation yields an \( n \)-dimensional vector as a result! How can we resolve this?
Here, the key is to make explicit the variables with respect to which we are differentiating. In the first case, we are differentiating the function \( f \) with respect to its arguments \( \mathbf{z} \) and then substituting the argument \( A\mathbf{x} \). In the second case, we are differentiating the composite function \( g(\mathbf{x}) = f(A\mathbf{x}) \) with respect to \( \mathbf{x} \) directly. We denote the first case as \( \nabla_{\mathbf{z}} f(A\mathbf{x}) \) and the second case as \( \nabla_{\mathbf{x}} f(A\mathbf{x}) \). Keeping the notation clear is extremely important (as youβll find out in your homework, in fact!).
4.2 The Hessian
Suppose that \( f : \mathbb{R}^n \to \mathbb{R} \) is a function that takes a vector in \( \mathbb{R}^n \) and returns a real number. Then the Hessian matrix with respect to \( \mathbf{x} \), written \( \nabla^2_{\mathbf{x}} f(\mathbf{x}) \) or simply as \( H \), is the \( n \times n \) matrix of partial derivatives, \[ \nabla^2_{\mathbf{x}} f(\mathbf{x}) \in \mathbb{R}^{n \times n} = \begin{bmatrix} \frac{\partial^2 f(\mathbf{x})}{\partial x_1^2} & \frac{\partial^2 f(\mathbf{x})}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f(\mathbf{x})}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f(\mathbf{x})}{\partial x_2 \partial x_1} & \frac{\partial^2 f(\mathbf{x})}{\partial x_2^2} & \cdots & \frac{\partial^2 f(\mathbf{x})}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f(\mathbf{x})}{\partial x_n \partial x_1} & \frac{\partial^2 f(\mathbf{x})}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f(\mathbf{x})}{\partial x_n^2} \end{bmatrix}. \] In other words, \( \nabla^2_{\mathbf{x}} f(\mathbf{x}) \in \mathbb{R}^{n \times n} \), with \[ (\nabla^2_{\mathbf{x}} f(\mathbf{x}))_{ij} = \frac{\partial^2 f(\mathbf{x})}{\partial x_i \partial x_j}. \]
Note that the Hessian is always symmetric, since \[ \frac{\partial^2 f(\mathbf{x})}{\partial x_i \partial x_j} = \frac{\partial^2 f(\mathbf{x})}{\partial x_j \partial x_i}. \] Similar to the gradient, the Hessian is defined only when \( f(\mathbf{x}) \) is real-valued.
It is natural to think of the gradient as the analogue of the first derivative for functions of vectors, and the Hessian as the analogue of the second derivative (and the symbols we use also suggest this relation). This intuition is generally correct, but there are a few caveats to keep in mind.
First, for real-valued functions of one variable \( f : \mathbb{R} \to \mathbb{R} \), it is a basic definition that the second derivative is the derivative of the first derivative, i.e., \[ \frac{\partial^2 f(x)}{\partial x^2} = \frac{\partial}{\partial x} \left( \frac{\partial}{\partial x} f(x) \right). \] However, for functions of a vector, the gradient of the function is a vector, and we cannot take the gradient of a vector β i.e., \[ \nabla_{\mathbf{x}} \nabla_{\mathbf{x}} f(\mathbf{x}) = \nabla_{\mathbf{x}} \begin{bmatrix} \frac{\partial f(\mathbf{x})}{\partial x_1} \\ \frac{\partial f(\mathbf{x})}{\partial x_2} \\ \vdots \\ \frac{\partial f(\mathbf{x})}{\partial x_n} \end{bmatrix}, \] and this expression is not defined. Therefore, it is not the case that the Hessian is the gradient of the gradient. However, this is almost true, in the following sense: If we look at the \( i \)-th entry of the gradient \( (\nabla_{\mathbf{x}} f(\mathbf{x}))_i = \frac{\partial f(\mathbf{x})}{\partial x_i} \), and take the gradient with respect to \( \mathbf{x} \), we get \[ \nabla_{\mathbf{x}} \frac{\partial f(\mathbf{x})}{\partial x_i} = \begin{bmatrix} \frac{\partial^2 f(\mathbf{x})}{\partial x_i \partial x_1} \\ \frac{\partial^2 f(\mathbf{x})}{\partial x_i \partial x_2} \\ \vdots \\ \frac{\partial^2 f(\mathbf{x})}{\partial x_i \partial x_n} \end{bmatrix}, \] which is the \( i \)-th column (or row) of the Hessian. Therefore, \[ \nabla^2_{\mathbf{x}} f(\mathbf{x}) = [\nabla_{\mathbf{x}} (\nabla_{\mathbf{x}} f(\mathbf{x}))_1 \quad \nabla_{\mathbf{x}} (\nabla_{\mathbf{x}} f(\mathbf{x}))_2 \quad \cdots \quad \nabla_{\mathbf{x}} (\nabla_{\mathbf{x}} f(\mathbf{x}))_n]. \] If we donβt mind being a little bit sloppy, we can say that (essentially) \[ \nabla^2_{\mathbf{x}} f(\mathbf{x}) = \nabla_{\mathbf{x}} (\nabla_{\mathbf{x}} f(\mathbf{x}))^T, \] so long as we understand that this really means taking the gradient of each entry of \( (\nabla_{\mathbf{x}} f(\mathbf{x}))^T \), not the gradient of the whole vector.
4.3 Gradients and Hessians of Quadratic and Linear Functions
For \( \mathbf{x} \in \mathbb{R}^n \), let \( f(\mathbf{x}) = \mathbf{b}^T \mathbf{x} \) for some known vector \( \mathbf{b} \in \mathbb{R}^n \). Then \[ f(\mathbf{x}) = \sum_{i=1}^n b_i x_i, \] so \[ \frac{\partial f(\mathbf{x})}{\partial x_k} = \frac{\partial}{\partial x_k} \sum_{i=1}^n b_i x_i = b_k. \] From this, we can easily see that \[ \nabla_{\mathbf{x}} \mathbf{b}^T \mathbf{x} = \mathbf{b}. \] This should be compared to the analogous situation in single-variable calculus, where \( \frac{\partial}{\partial x} ax = a \).
Now consider the quadratic function \( f(\mathbf{x}) = \mathbf{x}^T A \mathbf{x} \) for \( A \in S^n \). Remember that \[ f(\mathbf{x}) = \sum_{i=1}^n \sum_{j=1}^n A_{ij} x_i x_j. \] To take the partial derivative, weβll consider the terms including \( x_k \) and \( x_k^2 \) factors separately: \[ \frac{\partial f(\mathbf{x})}{\partial x_k} = \frac{\partial}{\partial x_k} \sum_{i=1}^n \sum_{j=1}^n A_{ij} x_i x_j = \sum_{i=1}^n A_{ik} x_i + \sum_{j=1}^n A_{kj} x_j = 2 \sum_{i=1}^n A_{ki} x_i, \] where the last equality follows since \( A \) is symmetric. Note that the \( k \)-th entry of \( \nabla_{\mathbf{x}} f(\mathbf{x}) \) is just the inner product of the \( k \)-th row of \( A \) and \( \mathbf{x} \). Therefore, \[ \nabla_{\mathbf{x}} \mathbf{x}^T A \mathbf{x} = 2A\mathbf{x}. \] Again, this should remind you of the analogous fact in single-variable calculus, that \( \frac{\partial}{\partial x} ax^2 = 2ax \).
Finally, letβs look at the Hessian of the quadratic function \( f(\mathbf{x}) = \mathbf{x}^T A \mathbf{x} \) (it should be obvious that the Hessian of a linear function \( \mathbf{b}^T \mathbf{x} \) is zero). In this case, \[ \frac{\partial^2 f(\mathbf{x})}{\partial x_k \partial x_\ell} = \frac{\partial}{\partial x_k} \left( \frac{\partial f(\mathbf{x})}{\partial x_\ell} \right) = 2A_{k\ell}. \] Therefore, it should be clear that \[ \nabla^2_{\mathbf{x}} \mathbf{x}^T A \mathbf{x} = 2A, \] which should be entirely expected (and again analogous to the single-variable fact that \( \frac{\partial^2}{\partial x^2} ax^2 = 2a \)).
To recap:
- \( \nabla_{\mathbf{x}} \mathbf{b}^T \mathbf{x} = \mathbf{b} \)
- \( \nabla_{\mathbf{x}} \mathbf{x}^T A \mathbf{x} = 2A\mathbf{x} \) (if \( A \) symmetric)
- \( \nabla^2_{\mathbf{x}} \mathbf{x}^T A \mathbf{x} = 2A \) (if \( A \) symmetric)
4.4 Least Squares
Letβs apply the equations we obtained in the last section to derive the least squares equations. Suppose we are given matrices \( A \in \mathbb{R}^{m \times n} \) (for simplicity we assume \( A \) is full rank) and a vector \( \mathbf{b} \in \mathbb{R}^m \) such that \( \mathbf{b} \notin R(A) \). In this situation, we will not be able to find a vector \( \mathbf{x} \in \mathbb{R}^n \), such that \( A\mathbf{x} = \mathbf{b} \), so instead we want to find a vector \( \mathbf{x} \) such that \( A\mathbf{x} \) is as close as possible to \( \mathbf{b} \), as measured by the square of the Euclidean norm \( \|A\mathbf{x} - \mathbf{b}\|_2^2 \).
Using the fact that \( \|\mathbf{x}\|_2^2 = \mathbf{x}^T \mathbf{x} \), we have \[ \|A\mathbf{x} - \mathbf{b}\|_2^2 = (A\mathbf{x} - \mathbf{b})^T (A\mathbf{x} - \mathbf{b}) = \mathbf{x}^T A^T A \mathbf{x} - 2\mathbf{b}^T A \mathbf{x} + \mathbf{b}^T \mathbf{b}. \] Taking the gradient with respect to \( \mathbf{x} \) and using the properties we derived in the previous section, \[ \nabla_{\mathbf{x}} (\mathbf{x}^T A^T A \mathbf{x} - 2\mathbf{b}^T A \mathbf{x} + \mathbf{b}^T \mathbf{b}) = \nabla_{\mathbf{x}} \mathbf{x}^T A^T A \mathbf{x} - \nabla_{\mathbf{x}} 2\mathbf{b}^T A \mathbf{x} + \nabla_{\mathbf{x}} \mathbf{b}^T \mathbf{b} = 2A^T A \mathbf{x} - 2A^T \mathbf{b}. \] Setting this last expression equal to zero and solving for \( \mathbf{x} \) gives the normal equations \[ \mathbf{x} = (A^T A)^{-1} A^T \mathbf{b}, \] which is the same as what we derived in class.
4.5 Gradients of the Determinant
Now letβs consider a situation where we find the gradient of a function with respect to a matrix, namely for \( A \in \mathbb{R}^{n \times n} \), we want to find \( \nabla_A |A| \). Recall from our discussion of determinants that \[ |A| = \sum_{i=1}^n (-1)^{i+j} A_{ij} |A_{\setminus i, \setminus j}| \quad (\text{for any } j \in 1, \ldots, n), \] so \[ \frac{\partial |A|}{\partial A_{k\ell}} = \frac{\partial}{\partial A_{k\ell}} \sum_{i=1}^n (-1)^{i+j} A_{ij} |A_{\setminus i, \setminus j}| = (-1)^{k+\ell} |A_{\setminus k, \setminus \ell}| = (\text{adj}(A))_{\ell k}. \] From this, it immediately follows from the properties of the adjoint that \[ \nabla_A |A| = (\text{adj}(A))^T = |A| A^{-T}. \]
Now letβs consider the function \( f : S^n_{++} \to \mathbb{R}, f(A) = \log |A| \). Note that we have to restrict the domain of \( f \) to be the positive definite matrices, since this ensures that \( |A| > 0 \), so that the log of \( |A| \) is a real number. In this case, we can use the chain rule (nothing fancy, just the ordinary chain rule from single-variable calculus) to see that \[ \frac{\partial \log |A|}{\partial A_{ij}} = \frac{\partial \log |A|}{\partial |A|} \frac{\partial |A|}{\partial A_{ij}} = \frac{1}{|A|} \frac{\partial |A|}{\partial A_{ij}}. \] From this, it should be obvious that \[ \nabla_A \log |A| = \frac{1}{|A|} \nabla_A |A| = A^{-1}, \] where we can drop the transpose in the last expression because \( A \) is symmetric. Note the similarity to the single-valued case, where \( \frac{\partial}{\partial x} \log x = \frac{1}{x} \).
4.6 Eigenvalues as Optimization
Finally, we use matrix calculus to solve an optimization problem in a way that leads directly to eigenvalue/eigenvector analysis. Consider the following, equality-constrained optimization problem: \[ \max_{\mathbf{x} \in \mathbb{R}^n} \mathbf{x}^T A \mathbf{x} \quad \text{subject to} \quad \|\mathbf{x}\|_2^2 = 1, \] for a symmetric matrix \( A \in S^n \). A standard way of solving optimization problems with equality constraints is by forming the Lagrangian, an objective function that includes the equality constraints. The Lagrangian in this case can be given by \[ L(\mathbf{x}, \lambda) = \mathbf{x}^T A \mathbf{x} - \lambda \mathbf{x}^T \mathbf{x}, \] where \( \lambda \) is called the Lagrange multiplier associated with the equality constraint. It can be established that for \( \mathbf{x}^* \) to be an optimal point to the problem, the gradient of the Lagrangian has to be zero at \( \mathbf{x}^* \) (this is not the only condition, but it is required). That is, \[ \nabla_{\mathbf{x}} L(\mathbf{x}, \lambda) = \nabla_{\mathbf{x}} (\mathbf{x}^T A \mathbf{x} - \lambda \mathbf{x}^T \mathbf{x}) = 2A \mathbf{x} - 2\lambda \mathbf{x} = 0. \] Notice that this is just the linear equation \[ A \mathbf{x} = \lambda \mathbf{x}. \] This shows that the only points which can possibly maximize (or minimize) \( \mathbf{x}^T A \mathbf{x} \) assuming \( \mathbf{x}^T \mathbf{x} = 1 \) are the eigenvectors of \( A \).
5. Linear Algebra and Matrix Calculus Cheatsheet
1. Matrix Operations
- Inverse: For \( A \in \mathbb{R}^{n \times n} \), \( A^{-1}A = I = AA^{-1} \). Exists only if \( A \) is full rank.
- Orthogonal Matrix: \( U^T U = I = U U^T \). Inverse is its transpose: \( U^{-1} = U^T \).
- Determinant: \( |A| \) measures the "volume" spanned by rows/columns of \( A \). Properties:
- \( |AB| = |A||B| \)
- \( |A^{-1}| = 1/|A| \)
- \( |A| = 0 \iff A \) is singular.
- Range (Columnspace): \( R(A) = \{\mathbf{v} \in \mathbb{R}^m : \mathbf{v} = A\mathbf{x}, \mathbf{x} \in \mathbb{R}^n\} \).
- Nullspace: \( N(A) = \{\mathbf{x} \in \mathbb{R}^n : A\mathbf{x} = \mathbf{0}\} \).
2. Quadratic Forms and Definiteness
- Quadratic Form: \( \mathbf{x}^T A \mathbf{x} = \sum_{i=1}^n \sum_{j=1}^n A_{ij} x_i x_j \).
- Positive Definite (PD): \( \mathbf{x}^T A \mathbf{x} > 0 \) for all \( \mathbf{x} \neq \mathbf{0} \).
- Positive Semidefinite (PSD): \( \mathbf{x}^T A \mathbf{x} \geq 0 \) for all \( \mathbf{x} \).
- Negative Definite (ND): \( \mathbf{x}^T A \mathbf{x} < 0 \) for all \( \mathbf{x} \neq \mathbf{0} \).
- Negative Semidefinite (NSD): \( \mathbf{x}^T A \mathbf{x} \leq 0 \) for all \( \mathbf{x} \).
- Indefinite: \( \mathbf{x}_1^T A \mathbf{x}_1 > 0 \) and \( \mathbf{x}_2^T A \mathbf{x}_2 < 0 \) for some \( \mathbf{x}_1, \mathbf{x}_2 \).
3. Eigenvalues and Eigenvectors
- Eigenvalue-Eigenvector Equation: \( A\mathbf{x} = \lambda\mathbf{x}, \mathbf{x} \neq \mathbf{0} \).
- Diagonalization: If \( A \) has \( n \) linearly independent eigenvectors, \( A = X \Lambda X^{-1} \).
- Symmetric Matrices:
- All eigenvalues are real.
- Eigenvectors are orthonormal: \( A = U \Lambda U^T \).
- Definiteness via Eigenvalues:
- PD: All \( \lambda_i > 0 \).
- PSD: All \( \lambda_i \geq 0 \).
- ND: All \( \lambda_i < 0 \).
- NSD: All \( \lambda_i \leq 0 \).
- Indefinite: Mixed signs of \( \lambda_i \).
4. Matrix Calculus
- Gradient: \( \nabla_{\mathbf{x}} f(\mathbf{x}) \in \mathbb{R}^n \): \[ \nabla_{\mathbf{x}} f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f(\mathbf{x})}{\partial x_1} \\ \frac{\partial f(\mathbf{x})}{\partial x_2} \\ \vdots \\ \frac{\partial f(\mathbf{x})}{\partial x_n} \end{bmatrix}. \]
- Hessian: \( \nabla^2_{\mathbf{x}} f(\mathbf{x}) \in \mathbb{R}^{n \times n} \): \[ (\nabla^2_{\mathbf{x}} f(\mathbf{x}))_{ij} = \frac{\partial^2 f(\mathbf{x})}{\partial x_i \partial x_j}. \] Always symmetric if \( f \) is twice differentiable.
- Gradients of Common Functions:
- \( \nabla_{\mathbf{x}} \mathbf{b}^T \mathbf{x} = \mathbf{b} \).
- \( \nabla_{\mathbf{x}} \mathbf{x}^T A \mathbf{x} = 2A\mathbf{x} \) (if \( A \) symmetric).
- \( \nabla^2_{\mathbf{x}} \mathbf{x}^T A \mathbf{x} = 2A \) (if \( A \) symmetric).
- Gradients of Determinants:
- \( \nabla_A |A| = |A| A^{-T} \).
- \( \nabla_A \log |A| = A^{-1} \) (if \( A \) symmetric).
5. Applications
- Least Squares: Solve \( \min_{\mathbf{x}} \|A\mathbf{x} - \mathbf{b}\|_2^2 \). Solution: \[ \mathbf{x} = (A^T A)^{-1} A^T \mathbf{b}. \]
- Eigenvalues as Optimization: Maximize \( \mathbf{x}^T A \mathbf{x} \) subject to \( \|\mathbf{x}\|_2^2 = 1 \). Optimal \( \mathbf{x} \) is the eigenvector corresponding to the largest eigenvalue of \( A \).