Linear Algebra for Deep Learning - Cookbook


Since linear algebra is a subfield of continuous mathematics, most computer scientists do not have enough experience with linear algebra, a tool vital for studying deep learning. In this chapter, we will introduce the essential knowledge of linear algebra as a prerequisite to deep learning models and algorithms in later chapters.

2.1 Scalars, Vectors, Matricies and Tensors

Scalar: A single number.
Vector: An array of numbers orderly arranged. We can identify each individual number by indexing. We denote vectors by bold letters like . If a vector has elements each , we denote .
\begin{bmatrix} x_1 \cr x_2 \cr … \cr x_{n-1} \cr x_n \end{bmatrix}
If we want to access multiple elements of a vector, we can define a set and access their respective elements by .

Matricies: A matrix is a 2-D array of numbers, so each element is identified by two indicies instead of by just one. We denote it as .

A_{1,1} & A_{1,2} & A_{1,3} \cr
A_{2,1} & A_{2,2} & A_{2,3}

Tensors: In general case, an array of numbers arranged on a regular grid with a variable number of axes is known as a tensor. That is, all scalars, vectors, and matricies are tensors.

One Important operation on matricies is the transpose. We denote transpose of a matrix as , and it is defined such that

2.2 Multiplying Matrices and Vectors

Matrix multiplication: If is of shape and is of shape , we define their matrix product of size as follows: $$ \mathbf{C} = \mathbf{A}\mathbf{B} $$ This operation is defined by $$C_{i,j} = \sum_{k}A_{i,k}B_{k,j}.$$

Note that this operation is not equivalent to multiplying element-wise. The element-wise product is defined as the Hadamard product, denoted as

Matrix multiplication is associative: $$\textbf{A}(\textbf{B}+\textbf{C}) = \textbf{A}\textbf{B} + \textbf{A}\textbf{C}.$$

It is also associative: $$\textbf{A}(\textbf{B}\textbf{C}) = (\textbf{A}\textbf{B})\textbf{C}.$$

However, matrix multiplication is not commutative. That is, does not always hold.

The transpose of matrix product has a simple form: $$(\textbf{A}\textbf{B})^\top = \textbf{B}^\top \textbf{A}^\top .$$

2.3 Identity and Inverse Matricies

To formally introduce matrix inversion, we need to define the concept of identity matrix. We define identity matrix as a matrix that preserves the -dimensional vector after matrix multiplication. We denote such identity matrix as . Formally, and $$\forall \mathbf{x} \in \mathbb{R}^{n}, I_{n} \mathbf{x} = \mathbf{x}.$$

From this, we define the inverse of a matrix , denoted by , as the matrix such that $$\mathbf{A}^{-1} \mathbf{A} = \mathbf{I}_{n}.$$

We now can solve the inequality in the following steps:

Of course, this solution depends on it being possible to find . In the following sections, we will discuss the sufficient conditions for to exist.

2.4 Linear Dependence and Span

In order for to exist, must have exactly one solution. If both and are solutions, then

is also a solution for any . That is, any linear combination of and is also a solution.

Geometrically, we can think of the columns of as the different directions we can travel from the origin (the directions are defined by the column vectors of ). In this view, the solution specifies how many units we need to travel in each of these direction to reach the vector : $$\textbf{A}\textbf{x} = \sum_{i}x_{i}\textbf{A}_{:,i}.$$

In general, this kind of operation is called a linear combination. Formally, a linear combination of the set of vectors is defined as taking the summation of each vector with a scalar : $$\sum_{i}c_{i}\mathbf{v}^{(i)}.$$

The span of a set of vectors is the set of all points obtainable by linear combination of the origional vectors.

Thus, determining whether has a solution amounts to determining whether is in the span of the columns in . We call this particular span as the column space or the range of .

In order for the system to have a solution for any , we therefore require column space of be all of . If any point is excluded from , that point count potentially be a value for that has no solution. This immediately implies has at least m columns.

We call a set of vectors linear dependent if one vector is in the span of other vectors in the same set. Thereby, we can establish a necessary and sufficient condition for to exist (ie. unique ): $$\mathbf{A} \text{ contains exactly } m \textbf{ linearly independent} \text{ columns}$$

A square matrix with linearly dependent columns is known as singular.

2.5 Norms

We use norms to measure the size of a vector in machine learning. Fosrmally, the norm is given by


norms are function mappings that maps any vector to a scalar. More rigorously, a norm is defined as any function that satisfies the following properties:

  1. (the triangle inequality)
  • Note that norms are not linear in

In particular, the norm is known as the Euclidean Norm. It measures the Euclidean distance from the origin to any vector. Since this norm is often used in machine learning, it can be notated as (omitting the subscript)

In reality, the squared norm , which can be easily calculated by , is a lot simpler to work with. For example, the derivatives of with respect to each element is dependent on the corresponding element of , while all of the derivatives of the norm depends on the entire vector.

However, the squared norm is undesirable sometimes because it changes very slowly near the origin. For machine learning algorithm that discriminates between zero and non-zero elements, this behavior undermines the model. In such cases, we resort to the norm, defined as $$\lVert \mathbf{x} \rVert_{1} = \sum_{i}\lvert x_{i}\rvert.$$ which converges linearly around the origin.

norm - : Incorrectly used by some authors to denote the number of non-zero parameters in . It is incorrect because this norm fails the third condition in the definition of norms.

norm - : Also known as the max norm This norm is the absolute value of the largest element in ,

In deep learning, we sometimes wish to measure the size of a matrix. We use the Frobenius norm , which is analogous to the norm, to achieve this: $$\lVert A \rVert_{F} = \sqrt{\sum_{i,j}A^{2}_{i,j}}$$

2.6 Special Kinds of Matricies and Vectors

Awaiting Update