Work in Progress: Lecture Notes on the Status of IEEE 754 May 31, 1996 2:44 pm
Lecture Notes on the Status of
IEEE Standard 754 for Binary Floating-Point Arithmetic
Prof. W. Kahan
Elect. Eng. & Computer Science
学习是什么
University of California
Berkeley CA 94720-1776
三味线Introduction:
Twenty years ago anarchy threatened floating-point arithmetic. Over a dozen commercially significant arithmetics boasted diver wordsizes, precisions, rounding procedures and over/underflow behaviors, and more were in the works. “Portable” software intended to reconcile that numerical diversity had become unbearably costly to develop.
Eleven years ago, when IEEE 754 became official, major microprocessor manufacturers had already adopted it despite the challenge it pod to implementors. With unprecedented altruism, hardware designers ro to its challenge in the belief that they would ea and encourage a vast burgeoning of numerical software. They did succeed to a considerable extent. Anyway, rounding anomalies that preoccupied all of us in the 1970s afflict only CRAYs now.
Now atrophy threatens features of IEEE 754 caught in a vicious circle:
Tho features lack support in programming languages and compilers,
so tho features are mishandled and/or practically unusable,
so tho features are little known and less in demand, and so
tho features lack support in programming languages and compilers.
To help break that circle, tho features are discusd in the notes under the following headings: Reprentable Numbers, Normal and Subnormal, Infinite and NaN 2 Encodings, Span and Precision 3-4 Multiply-Accumulate, a Mixed Blessing 5 Exceptions in General; Retrospective Diagnostics 6 Exception: Invalid Operation; NaNs 7 Exception: Divide by Zero; Infinities 10
Digression on Division by Zero; Two Examples 10 Exception: Overflow 14 Exception: Underflow 15
Digression on Gradual Underflow; an Example 16 Exception: Inexact 18 Directions of Rounding 18 Precisions of Rounding 19
甘露寺之变The Baleful Influence of Benchmarks; a Propod Benchmark 20 Exceptions in General, Reconsidered; a Suggested Scheme23 Ruminations on Programming Languages29 Annotated Bibliography 30
Insofar as this is a status report, it is subject to change and superdes versions with earlier dates. This version superdes one distributed at a panel discussion of “Floating-Point Past, Prent and Future” in a ries of San Francisco Bay Area Computer History Perspectives sponsored by Sun Microsystems Inc. in May 1995. A Post-Script version is accessible electronically as http.cs.berkeley.edu/~wkahan/ieee754status/ieee754.ps .
Work in Progress: Lecture Notes on the Status of IEEE 754 May 31, 1996 2:44 pm Reprentable Numbers:
IEEE 754 specifies three types or Formats of floating-point numbers:
Single ( Fortran's REAL*4, C's float ), ( Obligatory ),
Double ( Fortran's REAL*8, C's double ), ( Ubiquitous ), and
Double-Extended ( Fortran REAL*10+, C's long double ), ( Optional ).
( A fourth Quadruple-Precision format is not specified by IEEE 754 but has become a de facto standard among veral computer makers none of whom support it fully in hardware yet, so it runs slowly at best.)
Each format has reprentations for NaNs (Not-a-Number), ±∞ (Infinity), and its own t of finite real numbers all of the simple form
2 k+1-N n
with two integers n ( signed Significand ) and k ( unbiad signed Exponent ) that run throughout two intervals determined from the format thus:
K+1 Exponent bits: 1 - 2 K < k < 2 K . N Significant bits: -2 N < n < 2 N .
This conci reprentation 2 k+1-N n , unique to IEEE 754, is deceptively simple. At first sight it appears
potentially ambiguous becau, if n is even, dividing n by 2 ( a right-shift ) and then adding 1 to k makes no difference. Whenever such an ambiguity could ari it is resolved by minimizing the exponent k and thereby maximizing the magnitude of significand n ; this is “ Normalization ” which, if it succeeds, permits a Normal nonzero number to be expresd in the form 2 k+1-N n = ± 2 k ( 1 + f ) with a nonnegative fraction f < 1 .Besides the Normal numbers, IEEE 754 has Subnormal ( Denormalized ) numbers lacking or suppresd in earlier computer arithmetics; Subnormals, which permit Underflow to be Gradual, are nonzero numbers with an unnormalized significand n and the same minimal exponent k as is ud for 0 :
Subnormal 2 k+1-N n = ± 2 k (0 + f ) has k = 2 - 2 K and 0 < | n | < 2 N-1 , so 0 < f < 1 .
Thus, where earlier arithmetics had conspicuous gaps between 0 and the tiniest Normal numbers ± 2 2-2K , IEEE 754 fills the gaps with Subnormals spaced the same distance apart as the smallest Normal numbers: Subnormals [--- Normalized Numbers ----- - - - - - - - - - -> | | |
三角梅花语
0-!-!-+-!-+-+-+-!-+-+-+-+-+-+-+-!---+---+---+---+---+---+---+---!------ - - | | | | | |
Powers of 2 : 2 2-2K 2 3-2K 2 4-2
K
-+- Concutive Positive Floating-Point Numbers -+- Table of Formats’ Parameters:
Format
Bytes K+1 N Single
4824Double
81153 Double-Extended
≥ 10 ≥ 15 ≥
64( Quadruple 1615113 )
Work in Progress: Lecture Notes on the Status of IEEE 754 May 31, 1996 2:44 pm
IEEE 754 encodes floating-point numbers in memory (not in registers) in ways first propod by I.B. Goldberg
in Comm. ACM (1967) 105-6 ; it packs three fields with integers derived from the sign, exponent and significand of a number as follows. The leading bit is the sign bit, 0 for + and 1 for - . The next K+1 bits hold a biad exponent. The last N or N-1 bits hold the significand's magnitude. To simplify the following table, the significand n is dissociated from its sign bit so that n may be treated as nonnegative.
Encodings of ±2k+1-N n into Binary Fields :
Number Type Sign Bit K+1 bit Exponent Nth bit N-1 bits of Significand
NaNs:? binary 1111 binary 1xxx (xxx)
SNaNs:? binary 1111 nonzero binary 0xxx (xxx)
Infinities:± binary 111 (11110)
Normals:± k-1 + 2K1 nonnegative n - 2N-1 < 2N-1 Subnormals:± 00 positive n < 2N-1
Zeros:±000
Note that +0 and -0 are distinguishable and follow obvious rules specified by IEEE 754 even though floating-point arithmetical comparison says they are equal; there are good reasons to do this, some of them discusd in my 1987 paper “ Branch Cuts ... .” The two zeros are distinguishable arithmetically only by either division-by-zero ( producing appropriately signed infinities ) or el by the CopySign function recommended by IEEE 754 / 854. Infinities, SNaNs, NaNs and Subnormal numbers necessitate four more special cas.
IEEE Single and Double have no Nth bit in their significant digit fields; it is “ implicit.” 680x0 / ix87 Extendeds have an explicit Nth bit for historical reasons; it allowed the Intel 8087 to suppress the normalization of subnormals advantageously for certain scalar products in matrix computations, but this and other features of the 8087 were later deemed too arcane to include in IEEE 754, and have atrophied.
Non-Extended encodings are all “ Lexicographically Ordered,” which means that if two floating-point numbers in the same format are ordered ( say x < y ), then they are ordered the same way w
hen their bits are reinterpreted as Sign-Magnitude integers. Conquently, processors need no floating-point hardware to arch, sort and window floating-point arrays quickly. ( However, some processors rever byte-order!) Lexicographic order may also ea the implementation of a surprisingly uful function NextAfter(x, y) which delivers the neighbor of x in its floating-point format on the side towards y .
Algebraic operations covered by IEEE 754, namely + , - , · , / , √ and Binary <-> Decimal Conversion with rare exceptions, must be Correctly Rounded to the precision of the operation’s destination unless the programmer has specified a rounding other than the default. If it does not Overflow, a correctly rounded operation’s error cannot exceed half the gap between adjacent floating-point numbers astride the operation’s ideal ( unrounded ) result. Half-way cas are rounded to Nearest Even, which means that the neighbor with last digit 0 is chon. Besides its lack of statistical bias, this choice has a subtle advantage; it prevents prolonged drift during slowly convergent iterations containing steps like the:
While ( ... ) do { y := x+z ; ... ; x := y-z } .
A conquence of correct rounding ( and Gradual Underflow ) is that the calculation of an expressi
on X•Y for any algebraic operation • produces, if finite, a result (X•Y)·( 1 + ß ) + µ where |µ| cannot exceed half the smallest gap between numbers in the destination’s format, and |ß| < 2-N , and ß·µ = 0 . ( µ≠ 0 only when Underflow occurs.) This characterization constitutes a weak model of roundoff ud widely to predict error bounds for software. The model characterizes roundoff weakly becau, for instance, it cannot confirm that, in the abnce of Over/Underflow or division by zero, -1 ≤ x/√(x2 + y2) ≤ 1 despite five rounding errors, though this is true and easy to prove for IEEE 754, harder to prove for most other arithmetics, and can fail on a CRAY Y-MP.
Work in Progress: Lecture Notes on the Status of IEEE 754 May 31, 1996 2:44 pm
The following table exhibits the span of each floating-point format, and its precision both as an upper bound 2-N upon relative error ß and in “ Significant Decimals.”成都庙会
Span and Precision of IEEE 754 Floating-Point Formats :
Format Min. Subnormal Min. Normal Max. Finite2-N Sig. Dec.
Single: 1.4 E-45 1.2 E-38 3.4 E38 5.96 E-8 6 - 9
人间真情Double: 4.9 E-324 2.2 E-308 1.8 E308 1.11 E-1615 - 17
Extended:≤ 3.6 E-4951≤ 3.4 E-4932≥ 1.2 E4932≤ 5.42 E-20≥ 18 - 21 ( Quadruple: 6.5 E-4966 3.4 E-4932 1.2 E49329.63 E-35 33 - 36 )
Entries in this table come from the following formulas:
Min. Positive Subnormal:23 - 2K - N
Min. Positive Normal:22 - 2K
Max. Finite:(1 - 1/2N) 22K
Sig. Dec.,at least:floor( (N-1) Log10(2) ) sig. dec.
白带多是什么原因at most:ceil( 1 + N Log10(2) ) sig. dec.
The precision is bracketed within a range in order to characterize how accurately conversion between binary and decimal has to be implemented to conform to IEEE 754. For instance, “ 6 - 9 ” Sig. Dec. for Single means that, in the abnce of OVER/UNDERFLOW, ...
If a decimal string with at most 6 sig. dec. is converted to Single and then converted back to the
same number of sig. dec., then the final string should match the original. Also, ...
If a Single Precision floating-point number is converted to a decimal string with at least 9 sig.
dec. and then converted back to Single, then the final number must match the original.
Most microprocessors that support floating-point on-chip, and all that rve in prestigious workstations, support just the two REAL*4 and REAL*8 floating-point formats. In some cas the registers are all 8 bytes wide, and REAL*4 operands are converted on the fly to their REAL*8 equivalents when they are loaded into a register; in such cas, immediately rounding to REAL*4 every REAL*8 result of an operation upon such converted operands produces the same result as if the operation had been performed in the REAL*4 format all the way.
But Motorola 680x0-bad Macintoshes and Intel ix86-bad PCs with ix87-bad ( not Weitek’s 1167 or 3167 ) floating-point behave quite differently; they perform all arithmetic operations in the Extended format, regardless of the operands’ widths in memory, and round to whatever precision is called for by the tting of a control word.
Only the Extended format appears in a 680x0’s eight floating-point flat registers or an ix87’s eight floating-point stack-registers, so all numbers loaded from memory in any other format, floating-point or integer or BCD, are converted on the fly into Extended with no change in value. All arithmetic operations enjoy the Extended range and precision. Values stored from a register into a narrower memory format get rounded on the fly, and may also incur OVER/UNDERFLOW. ( Since the register’s value remains unchanged, unless popped off the ix87’s stack, misconstrued ambiguities in manuals or ill-considered “ optimizations ” cau some compilers sometimes wrongly to reu that register’s value in place of what was stored from it; this subtle bug will be re-examined later under " Precisions of Rounding " below.)
Since the Extended format is optional in implementations of IEEE 754, most chips do not offer it; it is available only on Intel’s x86/x87, Pentium, P6 and their clones by AMD and Cyrix, on Intel’s 80960 KB, on Motorola’s 68040/60 or earlier 680x0 with 68881/2 coprocessor, and on Motorola’s 88110, all with 64 sig.
Work in Progress: Lecture Notes on the Status of IEEE 754 May 31, 1996 2:44 pm
bits and 15 bits of exponent, but in words that may be 80 or 96 or 128 bits wide when stored in memory. This format is intended mainly to help programmers enhance the integrity of their Single and Double software, and to attenuate degradation by roundoff in Double matrix computations of larger dimensions, and can easily be ud in such a way that substituting Quadruple for Extended need never invalidate its u. However, language support for Extended is hard to find.
Multiply-Accumulate, a Mixed Blessing:
The IBM Power PC and Apple Power Macintosh, both derived from the IBM RS/6000 architecture, purport to conform to IEEE 754 but too often u a “ Fud ” Multiply-Add instruction in a non-conforming way. The idea behind a Multiply-Add ( or “ MAC ” for “ Multiply-Accumulate ” ) instruction is that an expression like
±a*b ± c be evaluated in one instruction so implemented that scalar products like
a1*b1 + a2*b2 + a3*b3 + ... + a L*b L
can be evaluated in about L+3 machine cycles. Many machines have a MAC. Beyond that, a Fud MAC evaluates ±a*b ± c with just one rounding error at the end. This is done not so much t
非谓语动词讲解
o roughly halve the rounding errors in a scalar product as to facilitate fast and correctly rounded division without much hardware dedicated to it. To compute q = x/y correctly rounded, it suffices to have hardware approximate the reciprocal 1/y to veral sig. bits by a value t looked up in a table, and then improve t by iteration thus:
t := t + (1 - t*y)*t .
Each such iteration doubles the number of correct bits in t at the cost of two MACs until t is accurate enough to produce q := t*x . To round q correctly, its remainder r := x - q*y must be obtained exactly; this is what the “ Fud ” in the Fud MAC is for. It also speeds up correctly rounded square root, decimal <-> binary conversion, and some transcendental functions. The and other us make a Fud MAC worth putting into a computer's instruction t. ( If only division and square root were at stake we might do better merely to widen the multiplier hardware slightly in a way accessible solely to microcode, as TI does in its SPARC chips.)
A Fud MAC also speeds up a grubby “Doubled-Double” approximation to Quadruple-Precision arithmetic by unevaluated sums of pairs of Doubles. Its advantage comes about from a Fud MAC's ability to evaluate any product a*b exactly; first let p := a*b rounded off; then compute c :
= a*b - p exactly in another Fud MAC, so that a*b = p + c exactly without roundoff. Fast but grubby Double-Double undermines the incentive to provide Quadruple-Precision correctly rounded in IEEE 754's style.
Fud MACs generate anomalies when ud to evaluate a*b ± c*d in two instructions instead of three. Which of a*b and c*d is evaluated and therefore rounded first? Either way, important expectations can be thwarted. For example, multiplying a complex number by its complex conjugate should produce a real number, but it might not with a Fud MAC. If √( q2 - p*r ) is real in the abnce of roundoff, then the same is expected for
SQRT( q*q - p*r )
despite roundoff, but perhaps not with a Fud MAC. Therefore Fud MACs cannot be ud indiscriminately; there are a few programs that contain a few assignment statements from which Fud MACs must be banned.
By design, a Fud MAC always runs faster than parate multiplication and add, so compiler writers with one eye on benchmarks bad solely upon speed leave programmers no way to inhibit Fud MACs lectively within expressions, nor to ban them from a lected assignment statement.
Ideally, some locution like redundant parenthes should be understood to control the u of Fud MACs on machines that have them. For instance, in Fortran, ...
(A*B) + C*D and C*D + (A*B) should always round A*B first;
(A*B) + (C*D) should inhibit the u of a Fud MAC here.
Something el is needed for C , who Macro Preprocessor often insinuates hordes of redundant parenthes. Whatever expedient is chon must have no effect upon compilations to machines that lack a Fud MAC; a parate compiler directive at the beginning of a program should say whether the program is intended solely for machines with, or solely for machines without a Fud MAC.