Understanding Floating-Point Representation and Errors

Posted on Mar 10, 2025 in Mathematics

Floating-Point Representation and Errors

t: precision – a positive integer

β: base (or radix) – an integer ≥ 2 (2, 10, 16)

e: exponent – an integer

(decimal) value d1.d2d3 · · · dt × β -> (d1 + d2/β¹ + · · · + dt/β^t-1 ) × β^e

exponent range emin ≤ e ≤ emax

1 + 2*((B -1)B^(t-1) * (emax- emin + 1)) norm

1 + 2 * (B^t * (emax – emin + 1)) denorm

Memory stored in 3 fields: sign (1 bit positive negative), exponent (depends on range), fraction or significant (depends on precision)

1 + EpsM > 1, EpsM = B^(1-t) (also distance between 2 consecutive numbers between 1 and B)

To compute, keep halving, *2 at end

1.10011001100 110011001101 × 2⁻⁴ = 0.1

Absolute error bounded by abs(fl(x) – x) < 1/2*b^(1-t)*b^e, interpret as error between d1.d2. . . dt and d1.d2 . . . (dt+1)=half of b^(1-t)=half unit in last place=ulp

Relative error bound abs(fl(x) – x)/abs(x) <= 1/2 * b^(1-t) called unit of roundoff, u

Base 2, single first, precision 24/53, emin -126/-1022, emax 127/1023

Formats single first, exp width 8bits/11bits, format width 32bits/64bits

Since base 2 + norm, first bit always 1, not stored and hidden

bias=emax

emax + 1= 8 1s, emin -1= 8 0s

If (x == 0) predictable whether x +’ve or -‘ve cuz signed 0

Infinities help overflow, c/0=+/- inf. (x/1+x^2)

NaN continues when math error, special numbers always have 8 1’s or 0’s, NaN followed by fraction

Denorm numbers, no hidden bit when being represented, 8 0’s, non-zero fractions, allows gradual flow

Underflow, number from operation is too small to be represented, defaults to 0

Can be avoided by scaling

S=max, divide both by S, compute magnitude (one number 0s when squared), multiply by S

Rounding error fl(a + b)=(a + b)(1 + ε), |ε| ≤ u.

Double 0.1 > 0.1

0.1 * 10 < 1 because rounding error

Finite sum to approximate infinite series results in truncation error, (xⁿ⁺¹)/(n+1)!

Continuous problem approximated by discrete introduces discretization error

Discret error=h/2*f”(E)

Total err. can be combination of both

Backwards error, you can get same result from slightly different input cuz rounding

Add small # to both, compute result, find absolute error, if its less than max of small #, its insensitive to perturbation

Cancellation in subtraction of 2 almost #s is alright

Final answer can be more accurate than intermediate quantities, errors can cancel

Catastrophic cancellation results when operands are subject to rounding errors, e.g. quadratic formula

Benign cancellation is when subtracting known quantities

For quadratic multiply roots by -b – sqrt(b² – 4ac), mix and match to avoid cancellation depending on if b>0 or <0

Singular matrix — no inverse, determinant=0

xi=det(Ai)/ det(a), not practical for real world applications

Compute a inverse, multiply it by b, unnecessary and inefficient and can be inaccurate

Row reduce and solve matrices

For k=1 to n-1 -> A(k+1:n,k) = A(k+1:n,k)/A(k,k); -> A(k+1:n,k+1:n) = A(k+1:n,k+1:n) – A(k+1:n,k)*A(k,k+1:n); -> end ->L = eye(n, n) + tril(A, −1) -> U = triu(A), calculates lower triangle L, upper triangular U, on return A overwritten by L and U

Solving triangular systems b(1) = b(1)/L(1,1); -> for k=2 to n -> tmp = b(k) – L(k,1:k-1)*b(1:k-1); -> b(k) = tmp/L(k,k); -> end

Understanding Floating-Point Representation and Errors

Floating-Point Representation and Errors

Recent Notes

Subjects

Publicidad