Lecture 1: Floating-point arithmetic, vector norms

Syllabus

Today:

Tomorrow: Matrix norms and unitary matrices

Representation of numbers

Fixed point

Floating point

The numbers in computer memory are typically represented as floating point numbers

A floating point number is represented as

$$\textrm{number} = \textrm{significand} \times \textrm{base}^{\textrm{exponent}},$$

where significand is integer, base is positive integer and exponent is integer (can be negative), i.e.

$$ 1.2 = 12 \cdot 10^{-1}.$$

Fixed vs Floating

Q: What are the advantages/disadvantages of the fixed and floating points?

A: In most cases, they work just fine.

IEEE 754

In modern computers, the floating point representation is controlled by IEEE 754 standard which was published in 1985 and before that point different computers behaved differently with floating point numbers.

IEEE 754 has:

Possible values are defined with

and have the following restrictions

The two most common format, single & double

The two most common formats, called binary32 and binary64 (called also single and double formats). Recently, the format binary16 plays important role in learning deep neural networks.

Name Common Name Base Digits Emin Emax
binary16 half precision 2 11 -14 + 15
binary32 single precision 2 24 -126 + 127
binary64 double precision 2 53 -1022 +1023

Examples

Q: what about -infinity and NaN ?

Accuracy and memory

The relative accuracy of single precision is $10^{-7}-10^{-8}$, while for double precision is $10^{-14}-10^{-16}$.

Crucial note 1: A float16 takes 2 bytes, float32 takes 4 bytes, float64, or double precision, takes 8 bytes.

Crucial note 2: These are the only two floating point-types supported in hardware (float32 and float64) + GPU/TPU different float types are supported.

Crucial note 3: You should use double precision in computational science and engineering and float on GPU/Data Science.

Also, half precision can be useful in training deep neural network, see this paper.

How does number representation format affect training of neural networks (NN)?

Plots are taken from this paper

bfloat16 (Brain Floating Point)

Tensor Float from Nvidia (blog post about this format)

Mixed precision (docs from Nvidia)

Alternative to the IEEE 754 standard

Issues in IEEE 754:

Concept of posits can replace floating point numbers, see this paper

Division accuracy demo

Square root accuracy demo

Exponent accuracy demo

Summary of demos

Loss of significance

Summation algorithm

However, the rounding errors can depend on the algorithm.

$$S = \sum_{i=1}^n x_i = x_1 + \ldots + x_n.$$

Naïve algorithm

Naïve algorithm adds numbers one-by-one:

$$y_1 = x_1, \quad y_2 = y_1 + x_2, \quad y_3 = y_2 + x_3, \ldots.$$

Kahan summation

The following algorithm gives $2 \varepsilon + \mathcal{O}(n \varepsilon^2)$ error, where $\varepsilon$ is the machine precision.

s = 0
c = 0
for i in range(len(x)):
    y = x[i] - c
    t = s + y
    c = (t - s) - y
    s = t

More complicated example

Summary of floating-point

Vectors

Example: Polynomials with degree $\leq n$ form a linear space. Polynomial $ x^3 - 2x^2 + 1$ can be considered as a vector $\begin{bmatrix}1 \\ -2 \\ 0 \\ 1\end{bmatrix}$ in the basis $\{x^3, x^2, x, 1\}$

Vector norm

Distances and norms

The norm should satisfy certain properties:

The distance between two vectors is then defined as

$$ d(x, y) = \Vert x - y \Vert. $$

Standard norms

The most well-known and widely used norm is euclidean norm:

$$\Vert x \Vert_2 = \sqrt{\sum_{i=1}^n |x_i|^2},$$

which corresponds to the distance in our real life. If the vectors have complex elements, we use their modulus.

$p$-norm

Euclidean norm, or $2$-norm, is a subclass of an important class of $p$-norms:

$$ \Vert x \Vert_p = \Big(\sum_{i=1}^n |x_i|^p\Big)^{1/p}. $$

There are two very important special cases:

$$ \Vert x \Vert_{\infty} = \max_i | x_i| $$

$$ \Vert x \Vert_1 = \sum_i |x_i| $$

We will give examples where $L_1$ norm is very important: it all relates to the compressed sensing methods that emerged in the mid-00s as one of the most popular research topics.

Equivalence of the norms

All norms are equivalent in the sense that

$$ C_1 \Vert x \Vert_* \leq \Vert x \Vert_{**} \leq C_2 \Vert x \Vert_* $$

for some positive constants $C_1(n), C_2(n)$, $x \in \mathbb{R}^n$ for any pairs of norms $\Vert \cdot \Vert_*$ and $\Vert \cdot \Vert_{**}$. The equivalence of the norms basically means that if the vector is small in one norm, it is small in another norm. However, the constants can be large.

Computing norms in Python

The NumPy package has all you need for computing norms: np.linalg.norm function.

Unit disks in different norms

Why $L_1$-norm can be important?

$L_1$ norm, as it was discovered quite recently, plays an important role in compressed sensing.

The simplest formulation of the considered problem is as follows:

The question: can we find the solution?

The solution is obviously non-unique, so a natural approach is to find the solution that is minimal in the certain sense:

\begin{align*} & \Vert x \Vert \rightarrow \min_x \\ \mbox{subject to } & Ax = f \end{align*}

What is a stable algorithm?

And we finalize the lecture by the concept of stability.

You also have a numerical algorithm alg(x) that actually computes approximation to $f(x)$.

The algorithm is called forward stable, if $$\Vert \text{alg}(x) - f(x) \Vert \leq \varepsilon $$

The algorithm is called backward stable, if for any $x$ there is a close vector $x + \delta x$ such that

$$\text{alg}(x) = f(x + \delta x)$$

and $\Vert \delta x \Vert$ is small.

Classical example

A classical example is the solution of linear systems of equations using Gaussian elimination which is similar to LU factorization (more details later)

We consider the Hilbert matrix with the elements

$$A = \{a_{ij}\}, \quad a_{ij} = \frac{1}{i + j + 1}, \quad i,j = 0, \ldots, n-1.$$

And consider a linear system

$$Ax = f.$$

We will look into matrices in more details in the next lecture, and for linear systems in the upcoming weeks

More examples of instability

How to compute the following functions in numerically stable manner?

Take home message