Handbook of Floating-Point Arithmetic

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Floating-point arithmetic is the most widely used way of implementing real-number arithmetic on modern computers. However, making such an arithmetic reliable and portable, yet fast, is a very difficult task. As a result, floating-point arithmetic is far from being exploited to its full potential. This handbook aims to provide a complete overview of modern floating-point arithmetic. So that the techniques presented can be put directly into practice in actual coding or design, they are illustrated, whenever possible, by a corresponding program.

The handbook is designed for programmers of numerical applications, compiler designers, programmers of floating-point algorithms, designers of arithmetic operators, and more generally, students and researchers in numerical analysis who wish to better understand a tool used in their daily work and research.

Author(s): Jean-Michel Muller, Nicolas Brunie, Florent de Dinechin, Claude-Pierre Jeannerod, Mioara Joldes, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, Serge Torres
Edition: 2
Publisher: Birkhäuser
Year: 2018

Language: English
Pages: 652
City: Cham
Tags: Floating-Point Arithmetic; IEEE 754-2008 Standard; Fused Multiply-Add Instruction; Floating-Point Operators; Elementary Functions; Fast2Sum and 2Sum Algorithms; Algorithm Analysis; Problem Complexity

Contents
List of Figures
List of Tables
Preface
I Introduction, Basic Definitions, and Standards
1 Introduction
1.1 Some History
1.2 Desirable Properties
1.3 Some Strange Behaviors
1.3.1 Some famous bugs
1.3.2 Difficult problems
1.3.2.1 A sequence that seems to converge to a wrong limit
1.3.2.2 The Chaotic Bank Society
1.3.2.3 Rump's example
2 Definitions and Basic Notions
2.1 Floating-Point Numbers
2.1.1 Main definitions
2.1.2 Normalized representations, normal and subnormal numbers
2.1.3 A note on underflow
2.1.4 Special floating-point data
2.2 Rounding
2.2.1 Rounding functions
2.2.2 Useful properties
2.3 Tools for Manipulating Floating-Point Errors
2.3.1 Relative error due to rounding
2.3.2 The ulp function
2.3.3 Link between errors in ulps and relative errors
2.3.3.1 Converting from errors in ulps to relative errors
2.3.3.2 Converting from relative errors to errors in ulps
2.3.3.3 Loss of information during these conversions
2.3.4 An example: iterated products
2.4 The Fused Multiply-Add (FMA) Instruction
2.5 Exceptions
2.6 Lost and Preserved Properties of Real Arithmetic
2.7 Note on the Choice of the Radix
2.7.1 Representation errors
2.7.2 A case for radix 10
2.8 Reproducibility
3 Floating-Point Formats and Environment
3.1 The IEEE 754-2008 Standard
3.1.1 Formats
3.1.1.1 Binary interchange format encodings
3.1.1.2 Decimal interchange format encodings
3.1.1.3 Larger formats
3.1.1.4 Extended and extendable precisions
3.1.1.5 Little-endian, big-endian
3.1.2 Attributes and rounding
3.1.2.1 Rounding direction attributes
3.1.2.2 Alternate exception-handling attributes
3.1.2.3 Preferred width attributes
3.1.2.4 Value-changing optimization attributes
3.1.2.5 Reproducibility attributes
3.1.3 Operations specified by the standard
3.1.3.1 Arithmetic operations and square root
3.1.3.2 Remainders
3.1.3.3 Preferred exponent for arithmetic operations in the decimal format
3.1.3.4 scaleB and logB
3.1.3.5 Miscellaneous
3.1.4 Comparisons
3.1.5 Conversions to/from string representations
3.1.6 Default exception handling
3.1.6.1 Invalid operation
3.1.6.2 Division by zero
3.1.6.3 Overflow
3.1.6.4 Underflow
3.1.6.5 Inexact
3.1.7 Special values
3.1.7.1 NaN: Not a Number
3.1.7.2 Arithmetic of infinities and zeros
3.1.8 Recommended functions
3.2 On the Possible Hidden Use of a Higher Internal Precision
3.3 Revision of the IEEE 754-2008 Standard
3.4 Floating-Point Hardware in Current Processors
3.4.1 The common hardware denominator
3.4.2 Fused multiply-add
3.4.3 Extended precision and 128-bit formats
3.4.4 Rounding and precision control
3.4.5 SIMD instructions
3.4.6 Binary16 (half-precision) support
3.4.7 Decimal arithmetic
3.4.8 The legacy x87 processor
3.5 Floating-Point Hardware in Recent Graphics Processing Units
3.6 IEEE Support in Programming Languages
3.7 Checking the Environment
3.7.1 MACHAR
3.7.2 Paranoia
3.7.3 UCBTest
3.7.4 TestFloat
3.7.5 Miscellaneous
II Cleverly Using Floating-Point Arithmetic
4 Basic Properties and Algorithms
4.1 Testing the Computational Environment
4.1.1 Computing the radix
4.1.2 Computing the precision
4.2 Exact Operations
4.2.1 Exact addition
4.2.2 Exact multiplications and divisions
4.3 Accurate Computation of the Sum of Two Numbers
4.3.1 The Fast2Sum algorithm
4.3.2 The 2Sum algorithm
4.3.3 If we do not use rounding to nearest
4.4 Accurate Computation of the Product of Two Numbers
4.4.1 The 2MultFMA Algorithm
4.4.2 If no FMA instruction is available: Veltkamp splitting and Dekker product
4.4.2.1 Veltkamp splitting
4.4.2.2 Dekker's multiplication algorithm
4.5 Computation of Residuals of Division and Square Root with an FMA
4.6 Another splitting technique: splitting around apower of 2
4.7 Newton–Raphson-Based Division with an FMA
4.7.1 Variants of the Newton–Raphson iteration
4.7.2 Using the Newton–Raphson iteration for correctly rounded division with an FMA
4.7.3 Possible double roundings in division algorithms
4.8 Newton–Raphson-Based Square Root with an FMA
4.8.1 The basic iterations
4.8.2 Using the Newton–Raphson iteration for correctly rounded square roots
4.9 Radix Conversion
4.9.1 Conditions on the formats
4.9.2 Conversion algorithms
4.9.2.1 Output conversion: from radix 2 to radix 10
4.9.2.2 Input conversion: from radix 10 to radix 2
4.10 Conversion Between Integers and Floating-Point Numbers
4.10.1 From 32-bit integers to floating-point numbers
4.10.2 From 64-bit integers to floating-point numbers
4.10.3 From floating-point numbers to integers
4.11 Multiplication by an Arbitrary-Precision Constant with an FMA
4.12 Evaluation of the Error of an FMA
5 Enhanced Floating-Point Sums, Dot Products,and Polynomial Values
5.1 Preliminaries
5.1.1 Floating-point arithmetic models
5.1.2 Notation for error analysis and classical error estimates
5.1.3 Some refined error estimates
5.1.4 Properties for deriving validated running error bounds
5.2 Computing Validated Running Error Bounds
5.3 Computing Sums More Accurately
5.3.1 Reordering the operands, and a bit more
5.3.2 Compensated sums
5.3.3 Summation algorithms that somehow imitate a fixed-point arithmetic
5.3.3.1 Rump, Ogita, and Oishi's faithful summation
5.3.3.2 Demmel, Ahrens, and Nguyen's reproducible summation
5.3.3.3 Towards accurate summation hardware?
5.3.4 On the sum of three floating-point numbers
5.4 Compensated Dot Products
5.5 Compensated Polynomial Evaluation
6 Languages and Compilers
6.1 A Play with Many Actors
6.1.1 Floating-point evaluation in programming languages
6.1.1.1 Expression evaluation order
6.1.1.2 Precision of intermediate computations
6.1.1.3 Antagonistic answers
6.1.2 Processors, compilers, and operating systems
6.1.3 Standardization processes
6.1.3.1 Standard bodies
6.1.3.2 Floating-point standards
6.1.3.3 C language standards
6.1.3.4 Implementations
6.1.4 In the hands of the programmer
6.2 Floating Point in the C Language
6.2.1 Standard C11 headers and IEEE 754-1985 support
6.2.2 Types
6.2.2.1 Infinities, NaNs, and signed zeros
6.2.3 Expression evaluation
6.2.3.1 Operators and functions
6.2.3.2 Contracted expressions
6.2.3.3 Constant expressions, initialization, and exceptions
6.2.3.4 Special values of mathematical functions
6.2.4 Code transformations
6.2.5 Enabling unsafe optimizations
6.2.5.1 Complex arithmetic in C11
6.2.5.2 Range reduction for trigonometric functions
6.2.5.3 Compiler-specific optimizations
6.2.6 Summary: a few horror stories
6.2.6.1 Printing out a variable changes its value
6.2.6.2 A possible infinite loop in a sort function
6.2.7 The CompCert C compiler
6.3 Floating-Point Arithmetic in the C++ Language
6.3.1 Semantics
6.3.2 Numeric limits
6.3.3 Overloaded functions
6.4 FORTRAN Floating Point in a Nutshell
6.4.1 Philosophy
6.4.2 IEEE 754 support in FORTRAN
6.5 Java Floating Point in a Nutshell
6.5.1 Philosophy
6.5.2 Types and classes
6.5.2.1 In the virtual machine
6.5.2.2 In the Java language
6.5.3 Infinities, NaNs, and signed zeros
6.5.4 Missing features
6.5.5 Reproducibility
6.5.6 The BigDecimal package
6.6 Conclusion
III Implementing Floating-Point Operators
7 Algorithms for the Basic Operations
7.1 Overview of Basic Operation Implementation
7.2 Implementing IEEE 754-2008 Rounding
7.2.1 Rounding a nonzero finite value with unbounded exponent range
7.2.1.1 Computing the successor in a binary interchange format
7.2.1.2 Choosing between |xp| and its successor Succ(|xp|)
7.2.2 Overflow
7.2.3 Underflow and subnormal results
7.2.4 The inexact exception
7.2.5 Rounding for actual operations
7.2.5.1 Decimal rounding using the binary encoding
7.3 Floating-Point Addition and Subtraction
7.3.1 Decimal addition
7.3.2 Decimal addition using binary encoding
7.3.3 Subnormal inputs and outputs in binary addition
7.4 Floating-Point Multiplication
7.4.1 Normal case
7.4.2 Handling subnormal numbers in binary multiplication
7.4.3 Decimal specifics
7.5 Floating-Point Fused Multiply-Add
7.5.1 Case analysis for normal inputs
7.5.1.1 Product-anchored case
7.5.1.2 Addend-anchored case
7.5.1.3 Cancellation
7.5.2 Handling subnormal inputs
7.5.3 Handling decimal cohorts
7.5.4 Overview of a binary FMA implementation
7.6 Floating-Point Division
7.6.1 Overview and special cases
7.6.2 Computing the significand quotient
7.6.3 Managing subnormal numbers
7.6.4 The inexact exception
7.6.5 Decimal specifics
7.7 Floating-Point Square Root
7.7.1 Overview and special cases
7.7.2 Computing the significand square root
7.7.3 Managing subnormal numbers
7.7.4 The inexact exception
7.7.5 Decimal specifics
7.8 Nonhomogeneous Operators
7.8.1 A software algorithm around double rounding
7.8.1.1 Problem statement
7.8.1.2 The easy cases
7.8.1.3 Implementing the format-reducing operation
7.8.2 The mixed-precision fused multiply-and-add
7.8.3 Motivation
7.8.4 Implementation issues
8 Hardware Implementation of Floating-Point Arithmetic
8.1 Introduction and Context
8.1.1 Processor internal formats
8.1.2 Hardware handling of subnormal numbers
8.1.3 Full-custom VLSI versus reconfigurable circuits (FPGAs)
8.1.4 Hardware decimal arithmetic
8.1.5 Pipelining
8.2 The Primitives and Their Cost
8.2.1 Integer adders
8.2.1.1 Carry-ripple adders
8.2.1.2 Parallel adders
8.2.1.3 Fast adders
8.2.1.4 Fast addition in FPGAs
8.2.2 Digit-by-integer multiplication in hardware
8.2.3 Using nonstandard representations of numbers
8.2.4 Binary integer multiplication
8.2.5 Decimal integer multiplication
8.2.6 Shifters
8.2.7 Leading-zero counters
8.2.7.1 Tree-based leading-zero counter
8.2.7.2 Leading-zero counting by monotonic string conversion
8.2.7.3 Combined leading-zero counting and shifting for FPGAs
8.2.8 Tables and table-based methods for fixed-point function approximation
8.2.8.1 Plain tables
8.2.8.2 Table-based methods
8.2.8.3 Conclusion
8.3 Binary Floating-Point Addition
8.3.1 Overview
8.3.2 A first dual-path architecture
8.3.3 Leading-zero anticipation
8.3.3.1 Subnormal handling in addition
8.3.4 Probing further on floating-point adders
8.4 Binary Floating-Point Multiplication
8.4.1 Basic architecture
8.4.2 FPGA implementation
8.4.3 VLSI implementation optimized for delay
8.4.4 Managing subnormals
8.5 Binary Fused Multiply-Add
8.5.1 Classic architecture
8.5.2 To probe further
8.6 Division and Square Root
8.6.1 Digit-recurrence division
8.6.2 Decimal division
8.7 Beyond the Classical Floating-Point Unit
8.7.1 More fused operators
8.7.2 Exact accumulation and dot product
8.7.3 Hardware-accelerated compensated algorithms
8.8 Floating-Point for FPGAs
8.8.1 Optimization in context of standard operators
8.8.2 Operations with a constant operand
8.8.3 Computing large floating-point sums
8.8.3.1 Application-specific accurate accumulator
8.8.3.2 Parallel summations
8.8.4 Block floating point
8.8.5 Algebraic operators
8.8.6 Elementary and compound functions
8.9 Probing Further
9 Software Implementation of Floating-Point Arithmetic
9.1 Implementation Context
9.1.1 Standard encoding of binary floating-point data
9.1.2 Available integer operators
9.1.3 First examples
9.1.3.1 Extracting the exponent field
9.1.3.2 Computing the ``is normal'' bit
9.1.3.3 Computing the number of leading zeros of a significand
9.1.4 Design choices and optimizations
9.2 Binary Floating-Point Addition
9.2.1 Handling special values
9.2.1.1 Detecting that a special value must be returned
9.2.1.2 Returning special values as recommended or required by IEEE 754-2008
9.2.2 Computing the sign of the result
9.2.3 Swapping the operands and computing the alignment shift
9.2.3.1 Operand swap
9.2.3.2 Alignment shift
9.2.4 Getting the correctly rounded result
9.2.4.1 A first easy case: x = -y =0
9.2.4.2 A second easy case: both x and y are subnormal numbers
9.2.4.3 Implementation of the general case
9.3 Binary Floating-Point Multiplication
9.3.1 Handling special values
9.3.1.1 Detecting that a special value must be returned
9.3.1.2 Returning special values as recommended or required by IEEE 754-2008
9.3.2 Sign and exponent computation
9.3.2.1 Computing the nonnegative integer D-1
9.3.3 Overflow detection
9.3.3.1 Overflow before rounding
9.3.3.2 Overflow after rounding
9.3.4 Getting the correctly rounded result
9.3.4.1 Computing the normalized significands mx' and my'
9.3.4.2 Computing the product mx'my' exactly
9.3.4.3 Computing the guard and sticky bits needed for rounding correctly
9.3.4.4 Rounding the significand and packing the result
9.4 Binary Floating-Point Division
9.4.1 Handling special values
9.4.1.1 Detecting that a special value must be returned
9.4.1.2 Returning special values as recommended or required by IEEE 754-2008
9.4.2 Sign and exponent computation
9.4.2.1 Computing the nonnegative integer D-1
9.4.3 Overflow detection
9.4.4 Getting the correctly rounded result
9.4.4.1 First example: restoring division
9.4.4.2 Second example: division by polynomial evaluation
9.5 Binary Floating-Point Square Root
9.5.1 Handling special values
9.5.1.1 Detecting that a special value must be returned
9.5.1.2 Returning special values as recommended or required by IEEE 754-2008
9.5.2 Exponent computation
9.5.2.1 Formula for the exponent of the result
9.5.2.2 Implementation for the binary32 format
9.5.3 Getting the correctly rounded result
9.5.3.1 Computation of the shift value c
9.5.3.2 First example: restoring square root
9.5.3.3 Second example: square root by polynomial evaluation
9.6 Custom Operators
10 Evaluating Floating-Point Elementary Functions
10.1 Introduction
10.1.1 Which accuracy?
10.1.2 The various steps of function evaluation
10.1.2.1 Range reduction
10.1.2.2 Function approximation
10.2 Range Reduction
10.2.1 Basic range reduction algorithms
10.2.1.1 Cody and Waite's reduction algorithm
10.2.1.2 Payne and Hanek's algorithm
10.2.2 Bounding the relative error of range reduction
10.2.3 More sophisticated range reduction algorithms
10.2.4 Examples
10.2.4.1 An example of range reduction for the exponential function
10.2.4.2 An example of range reduction for the logarithm
10.3 Polynomial Approximations
10.3.1 L2 case
10.3.2 L∞, or minimax, case
10.3.3 ``Truncated'' approximations
10.3.4 In practice: using the Sollya tool to compute constrained approximations and certified error bounds
10.4 Evaluating Polynomials
10.4.1 Evaluation strategies
10.4.2 Evaluation error
10.5 The Table Maker's Dilemma
10.5.1 When there is no need to solve the TMD
10.5.2 On breakpoints
10.5.2.1 Breakpoints of some algebraic functions
10.5.2.2 Breakpoints of transcendental functions
10.5.3 Finding the hardest-to-round points
10.5.3.1 Hardest-to-round points of transcendental functions in binary32 arithmetic
10.5.3.2 Hardest-to-round points of transcendental functions in binary64 arithmetic
10.5.3.3 Beyond binary64?
10.6 Some Implementation Tricks Used in the CRlibm Library
10.6.1 Rounding test
10.6.2 Accurate second step
10.6.3 Error analysis and the accuracy/performance tradeoff
10.6.4 The point with efficient code
10.6.4.1 Example: a double-binary64 polynomial evaluation
IV Extensions
11 Complex Numbers
11.1 Introduction
11.2 Componentwise and Normwise Errors
11.3 Computing ad bc with an FMA
11.4 Complex Multiplication
11.4.1 Complex multiplication without an FMA instruction
11.4.2 Complex multiplication with an FMA instruction
11.5 Complex Division
11.5.1 Error bounds for complex division
11.5.2 Scaling methods for avoiding over-/underflow in complex division
11.6 Complex Absolute Value
11.6.1 Error bounds for complex absolute value
11.6.2 Scaling for the computation of complex absolute value
11.7 Complex Square Root
11.7.1 Error bounds for complex square root
11.7.2 Scaling techniques for complex square root
11.8 An Alternative Solution: Exception Handling
12 Interval Arithmetic
12.1 Introduction to Interval Arithmetic
12.1.1 Definitions and the inclusion property
12.1.2 Loss of algebraic properties
12.2 The IEEE 1788-2015 Standard for Interval Arithmetic
12.2.1 Structuration into levels
12.2.2 Flavors
12.2.2.1 Common intervals and operations
12.2.2.2 Set-based flavor
12.2.3 Decorations
12.2.4 Level 2: discretization issues
12.2.5 Exact dot product
12.2.6 Levels 3 and 4: implementation issues
12.2.7 Libraries implementing IEEE 1788–2015
12.3 Intervals with Floating-Point Bounds
12.3.1 Implementation using floating-point arithmetic
12.3.2 Difficulties
12.3.3 Optimized rounding
12.4 Interval Arithmetic and Roundoff Error Analysis
12.4.1 Influence of the computing precision
12.4.2 A more efficient approach: the mid-rad representation
12.4.2.1 Why is the mid-rad representation of intervals interesting?
12.4.2.2 Influence of roundoff errors
12.4.3 Variants: affine arithmetic, Taylor models
12.4.3.1 Affine arithmetic
12.4.3.2 Polynomial models, Taylor models
12.5 Further Readings
13 Verifying Floating-Point Algorithms
13.1 Formalizing Floating-Point Arithmetic
13.1.1 Defining floating-point numbers
13.1.1.1 Structural definition
13.1.1.2 Binary representation
13.1.1.3 Semantic interpretation
13.1.2 Simplifying the definition
13.1.3 Defining rounding operators
13.1.3.1 Range and precision
13.1.3.2 Relational and functional definitions
13.1.3.3 Monotonicity
13.1.4 Extending the set of numbers
13.2 Formalisms for Verifying Algorithms
13.2.1 Hardware units
13.2.2 Floating-point algorithms
13.2.3 Automating proofs
13.3 Roundoff Errors and the Gappa Tool
13.3.1 Computing on bounds
13.3.1.1 Exceptional behaviors
13.3.1.2 Quantifying the errors
13.3.2 Counting digits
13.3.2.1 Fixed-point arithmetic
13.3.2.2 Floating-point arithmetic
13.3.2.3 Application
13.3.3 Manipulating expressions
13.3.3.1 Errors between structurally similar expressions
13.3.3.2 Using intermediate expressions
13.3.3.3 Cases of user hints
13.3.4 Handling the relative error
13.3.4.1 Always-bounded relative errors
13.3.4.2 Propagation of relative errors
13.3.5 Example: toy implementation of sine
13.3.6 Example: integer division on Itanium
14 Extending the Precision
14.1 Double-Words, Triple-Words…
14.1.1 Double-word arithmetic
14.1.1.1 Addition of double-word numbers
14.1.1.2 Multiplication of double-word numbers
14.1.1.3 Division of a double-word number by a floating-point number
14.1.2 Static triple-word arithmetic
14.2 Floating-Point Expansions
14.2.1 Renormalization of floating-point expansions
14.2.1.1 The VecSumErrBranch algorithm
14.2.1.2 The renormalization algorithm
14.2.1.3 Renormalization of arbitrary numbers
14.2.2 Addition of floating-point expansions
14.2.2.1 Accurate addition algorithm
14.2.2.2 QD-like addition algorithm
14.2.3 Multiplication of floating-point expansions
14.2.3.1 QD-like multiplication algorithm
14.2.3.2 Accurate multiplication algorithm
14.2.4 Division of floating-point expansions
14.2.4.1 Newton-Raphson based reciprocal algorithm
14.2.4.2 Newton-Raphson based division algorithm
14.3 Floating-Point Numbers with Batched Additional Exponent
14.4 Large Precision Based on a High-Radix Representation
14.4.1 Specifications
14.4.2 Using arbitrary-precision integer arithmetic for arbitrary-precision floating-point arithmetic
14.4.3 A brief introduction to arbitrary-precision integer arithmetic
14.4.4 GNU MPFR
14.4.4.1 MPFR data
14.4.4.2 Rounding modes
14.4.4.3 Exceptions
14.4.4.4 The ternary value
14.4.4.5 Types and calling convention
14.4.4.6 Memory handling
14.4.4.7 Implemented functions
14.4.4.8 An example of how to use MPFR
14.4.4.9 Caveats
14.4.4.10 Beyond GNU MPFR
Appendix A. Number Theory Tools for Floating-Point Arithmetic
A.1 Continued Fractions
A.2 Euclidean Lattices
Appendix B. Previous Floating-Point Standards
B.1 The IEEE 754-1985 Standard
B.1.1 Formats specified by IEEE 754-1985
B.1.2 Rounding modes specified by IEEE 754-1985
B.1.3 Operations specified by IEEE 754-1985
B.1.3.1 Arithmetic operations and square root
B.1.3.2 Conversions to and from decimal strings
B.1.4 Exceptions specified by IEEE 754-1985
B.2 The IEEE 854-1987 Standard
B.2.1 Constraints internal to a format
B.2.2 Various formats and the constraints between them
B.2.3 Rounding
B.2.4 Operations
B.2.5 Comparisons
B.2.6 Exceptions
B.3 The Need for a Revision
B.4 The IEEE 754-2008 Revision
Bibliography
Index