0%

### Introduction

Since linear algebra is a subfield of continuous mathematics, most computer scientists do not have enough experience with linear algebra, a tool vital for studying deep learning. In this chapter, we will introduce the essential knowledge of linear algebra as a prerequisite to deep learning models and algorithms in later chapters. ### 2.1 Scalars, Vectors, Matricies and Tensors

Scalar: A single number.
Vector: An array of numbers orderly arranged. We can identify each individual number by indexing. We denote vectors by bold letters like $\mathbf{x_{i}}$. If a vector $\mathbf{x}$ has $n$ elements each $\in \mathbb{R}$, we denote $\mathbf{x} \in \mathbb{R}^n$ .
\begin{bmatrix} x_1 \cr x_2 \cr … \cr x_{n-1} \cr x_n \end{bmatrix}
If we want to access multiple elements of a vector, we can define a set $S = \\{ 2,3,5 \\}$ and access their respective elements by $x_{S}$ .

Matricies: A matrix is a 2-D array of numbers, so each element is identified by two indicies instead of by just one. We denote it as $\mathbf{A}_{i,j}$ .

\begin{bmatrix}
A_{1,1} & A_{1,2} & A_{1,3} \cr
A_{2,1} & A_{2,2} & A_{2,3}
\end{bmatrix}

Tensors: In general case, an array of numbers arranged on a regular grid with a variable number of axes is known as a tensor. That is, all scalars, vectors, and matricies are tensors.

One Important operation on matricies is the transpose. We denote transpose of a matrix $\mathbf{A}$ as $\mathbf{A}^\top$, and it is defined such that $(\mathbf{A})^\top_{i,j} = \mathbf{A}_{j,i}$

### 2.2 Multiplying Matrices and Vectors

Matrix multiplication: If $\mathbf{A}$ is of shape $m \times n$ and $\mathbf{B}$ is of shape $n \times p$, we define their matrix product $\mathbf{C}$ of size $m \times p$ as follows: $$\mathbf{C} = \mathbf{A}\mathbf{B}$$ This operation is defined by $$C_{i,j} = \sum_{k}A_{i,k}B_{k,j}.$$

Note that this operation is not equivalent to multiplying element-wise. The element-wise product is defined as the Hadamard product, denoted as $\textbf{A} \odot \textbf{B}$

Matrix multiplication is associative: $$\textbf{A}(\textbf{B}+\textbf{C}) = \textbf{A}\textbf{B} + \textbf{A}\textbf{C}.$$

It is also associative: $$\textbf{A}(\textbf{B}\textbf{C}) = (\textbf{A}\textbf{B})\textbf{C}.$$

However, matrix multiplication is not commutative. That is, $\textbf{A}\textbf{B} = \textbf{B}\textbf{A}$ does not always hold.

The transpose of matrix product has a simple form: $$(\textbf{A}\textbf{B})^\top = \textbf{B}^\top \textbf{A}^\top .$$

### 2.3 Identity and Inverse Matricies

To formally introduce matrix inversion, we need to define the concept of identity matrix. We define identity matrix as a matrix that preserves the $n$-dimensional vector $\mathbf{x} \in \mathbb{R}^{n}$ after matrix multiplication. We denote such identity matrix as $I_{n}$. Formally, $I_{n} \in \mathbb{R}^{n \times n}$ and $$\forall \mathbf{x} \in \mathbb{R}^{n}, I_{n} \mathbf{x} = \mathbf{x}.$$

From this, we define the inverse of a matrix $\mathbf{A}$, denoted by $\mathbf{A}^{-1}$, as the matrix such that $$\mathbf{A}^{-1} \mathbf{A} = \mathbf{I}_{n}.$$

We now can solve the inequality $\mathbf{A}\mathbf{x} = \mathbf{b}$ in the following steps:

Of course, this solution depends on it being possible to find $\mathbf{A}^{-1}$. In the following sections, we will discuss the sufficient conditions for $\mathbf{A}^{-1}$ to exist.

### 2.4 Linear Dependence and Span

In order for $\mathbf{A}^{-1}$ to exist, $\mathbf{A}\mathbf{x} = \mathbf{b}$ must have exactly one solution. If both $\mathbf{x}$ and $\mathbf{y}$ are solutions, then

is also a solution for any $\alpha \in \mathbb{R}$. That is, any linear combination of $\mathbf{x}$ and $\mathbf{y}$ is also a solution.

Geometrically, we can think of the columns of $\textbf{A}$ as the different directions we can travel from the origin (the directions are defined by the column vectors of $\textbf{A}$). In this view, the solution $\mathbf{x}$ specifies how many units we need to travel in each of these direction to reach the vector $\mathbf{b}$: $$\textbf{A}\textbf{x} = \sum_{i}x_{i}\textbf{A}_{:,i}.$$

In general, this kind of operation is called a linear combination. Formally, a linear combination of the set of vectors $\\{\textbf{v}^{(1)}, \cdots , \textbf{v}^{(n)} \\}$ is defined as taking the summation of each vector $\mathbf{v}^{(i)}$ with a scalar $c_{i}$: $$\sum_{i}c_{i}\mathbf{v}^{(i)}.$$

The span of a set of vectors is the set of all points obtainable by linear combination of the origional vectors.

Thus, determining whether $\mathbf{A}\mathbf{x} = \mathbf{b}$ has a solution amounts to determining whether $\mathbf{b}$ is in the span of the columns in $\mathbf{A}$. We call this particular span as the column space or the range of $\mathbf{A}$.

In order for the system $\mathbf{A}\mathbf{x} = \mathbf{b}$ to have a solution for any $\mathbf{b} \in \mathbb{R}^{m}$, we therefore require column space of $\mathbf{A}$ be all of $\mathbb{R}^{m}$. If any point is excluded from $\mathbb{R}^{m}$, that point count potentially be a value for $\mathbf{b}$ that has no solution. This immediately implies $\textbf{A}$ has at least m columns.

We call a set of vectors linear dependent if one vector is in the span of other vectors in the same set. Thereby, we can establish a necessary and sufficient condition for $\mathbf{A}^{-1}$ to exist (ie. $\exists$ unique $\mathbf{x} \text{ for } \forall \mathbf{b} \in \mathbb{R}^{m} \textit{ s.t } \mathbf{A}\mathbf{x} = \mathbf{b}$): $$\mathbf{A} \text{ contains exactly } m \textbf{ linearly independent} \text{ columns}$$

A square matrix with linearly dependent columns is known as singular.

### 2.5 Norms

We use norms to measure the size of a vector in machine learning. Fosrmally, the $L^p$ norm is given by

for $p \in \mathbb{R}, p \geq 1.$

$L^{p}$ norms are function mappings that maps any vector $\mathbf{x}$ to a scalar. More rigorously, a norm is defined as any function $f$ that satisfies the following properties:

1. $f(\textbf{x}) = 0 \Rightarrow \textbf{x} = \mathbf{0}$
2. $f(\mathbf{x} + \mathbf{y}) \leq f(\mathbf{x}) + f(\mathbf{y})$ (the triangle inequality)
3. $\forall \alpha \in \mathbb{R}, f(\alpha \mathbf{x}) = \lvert \alpha \rvert f(\mathbf{x})$
• Note that norms are not linear in $\mathbb{R}^{n}$

In particular, the $L^{2}$ norm is known as the Euclidean Norm. It measures the Euclidean distance from the origin to any vector. Since this norm is often used in machine learning, it can be notated as $\lVert \mathbf{x} \rVert$ (omitting the $_{2}$ subscript)

In reality, the squared $L^{2}$ norm $\lVert \mathbf{x} \rVert^{2}$, which can be easily calculated by $\mathbf{x}^\top\mathbf{x}$, is a lot simpler to work with. For example, the derivatives of $\lVert \mathbf{x} \rVert^{2}$ with respect to each element $\mathbf{x}$ is dependent on the corresponding element of $\mathbf{x}$, while all of the derivatives of the $L^{2}$ norm depends on the entire vector.

However, the squared $L^{2}$ norm is undesirable sometimes because it changes very slowly near the origin. For machine learning algorithm that discriminates between zero and non-zero elements, this behavior undermines the model. In such cases, we resort to the $L^{1}$ norm, defined as $$\lVert \mathbf{x} \rVert_{1} = \sum_{i}\lvert x_{i}\rvert.$$ which converges linearly around the origin.

$L^{0}$ norm - $\lVert \textbf{x} \rVert_{0}$ : Incorrectly used by some authors to denote the number of non-zero parameters in $\mathbf{x}$. It is incorrect because this norm fails the third condition in the definition of norms.

$L^{\inf}$ norm - $\lVert \textbf{x} \rVert_{\inf}$ : Also known as the max norm This norm is the absolute value of the largest element in $\mathbf{x}$,

In deep learning, we sometimes wish to measure the size of a matrix. We use the Frobenius norm $\lVert A \rVert_{F}$, which is analogous to the $L^{2}$ norm, to achieve this: $$\lVert A \rVert_{F} = \sqrt{\sum_{i,j}A^{2}_{i,j}}$$

Awaiting Update