ieee754

更新时间:2023-05-19 14:50:33 阅读: 评论:0

Work in Progress:                      Lecture Notes on the Status of  IEEE 754                    May 31, 1996 2:44 pm
Lecture Notes on the Status of
IEEE Standard 754  for  Binary Floating-Point Arithmetic
Prof. W. Kahan
Elect. Eng. & Computer Science
学习是什么
University of California
Berkeley  CA  94720-1776
三味线Introduction:
Twenty years ago anarchy threatened floating-point arithmetic.  Over a dozen commercially significant arithmetics boasted diver wordsizes,  precisions,  rounding procedures and over/underflow behaviors,  and more were in the works.  “Portable”  software intended to reconcile that numerical diversity had become unbearably costly to develop.
Eleven years ago,  when  IEEE 754  became official,  major microprocessor manufacturers had already adopted it despite the challenge it pod to implementors.  With unprecedented altruism,  hardware designers ro to its challenge in the belief that they would ea and encourage a vast burgeoning of numerical software.  They did succeed to a considerable extent.  Anyway,  rounding anomalies that preoccupied all of us in the 1970s  afflict only  CRAYs  now.
Now atrophy threatens features of  IEEE 754  caught in a vicious circle:
Tho features lack support in programming languages and compilers,
so tho features are mishandled and/or practically unusable,
so tho features are little known and less in demand,  and so
tho features lack support in programming languages and compilers.
To help break that circle,  tho features are discusd in the notes under the following headings: Reprentable Numbers,  Normal  and  Subnormal,  Infinite  and  NaN  2 Encodings,  Span  and Precision 3-4 Multiply-Accumulate,  a  Mixed Blessing  5 Exceptions in General;  Retrospective Diagnostics  6 Exception:  Invalid Operation;  NaNs  7 Exception:  Divide by Zero;  Infinities 10
Digression on Division by Zero;  Two Examples 10 Exception:  Overflow 14 Exception:  Underflow 15
Digression on Gradual Underflow;  an Example 16 Exception:  Inexact 18 Directions of Rounding 18 Precisions of Rounding 19
甘露寺之变The Baleful Influence of Benchmarks;  a  Propod Benchmark 20 Exceptions in General,  Reconsidered;  a  Suggested Scheme23 Ruminations  on  Programming Languages29 Annotated Bibliography 30
Insofar as this is a status report,  it is subject to change and superdes versions with earlier dates.  This version superdes one distributed at a panel discussion of  “Floating-Point Past, Prent and Future”  in a ries of  San Francisco Bay Area Computer History Perspectives  sponsored by  Sun Microsystems Inc.  in  May 1995.  A Post-Script  version is accessible electronically as  http.cs.berkeley.edu/~wkahan/ieee754status/ieee754.ps  .
Work in Progress:                      Lecture Notes on the Status of  IEEE 754                    May 31, 1996 2:44 pm  Reprentable Numbers:
IEEE 754  specifies three types or  Formats    of floating-point numbers:
Single  ( Fortran's  REAL*4,  C's  float  ),          ( Obligatory ),
Double  ( Fortran's  REAL*8,  C's  double  ),          ( Ubiquitous ),  and
Double-Extended  ( Fortran REAL*10+,  C's  long double  ),          ( Optional ).
( A fourth  Quadruple-Precision  format is not specified by  IEEE 754  but has become a  de facto    standard among veral computer makers none of whom support it fully in hardware yet,  so it runs slowly at best.)
Each format has reprentations for  NaNs (Not-a-Number),  ±∞  (Infinity),  and its own t of finite real numbers all of the simple form
2 k+1-N  n
with two integers  n  ( signed  Significand  )  and  k  ( unbiad signed  Exponent  )  that run throughout two intervals determined from the format thus:
K+1  Exponent bits:  1 - 2 K    <  k  <  2 K    .              N  Significant bits:  -2 N    <  n  <  2 N    .
This conci reprentation    2 k+1-N  n ,  unique to  IEEE 754,  is deceptively simple.  At first sight it appears
potentially ambiguous becau,  if  n  is even,  dividing  n  by  2  ( a right-shift )  and then adding  1  to  k  makes no difference.  Whenever such an ambiguity could ari it is resolved by minimizing the exponent  k  and thereby maximizing the magnitude of significand  n ;  this is  “ Normalization ”  which,  if it succeeds,  permits a  Normal  nonzero number to be expresd in the form      2 k+1-N  n  =  ± 2 k  ( 1 +    f  )  with a nonnegative  fraction    f  < 1 .Besides the  Normal  numbers,  IEEE 754  has  Subnormal ( Denormalized )  numbers lacking or suppresd in earlier computer arithmetics;  Subnormals,  which permit  Underflow  to be  Gradual,  are nonzero numbers with an unnormalized significand  n  and the same minimal exponent  k  as is ud for  0 :
Subnormal    2 k+1-N  n  =  ± 2 k  (0 +    f  )    has    k  =  2 - 2 K    and    0  <  | n |  <  2 N-1  ,  so  0 <  f  < 1 .
Thus,  where earlier arithmetics had conspicuous gaps between  0  and the tiniest  Normal  numbers    ± 2 2-2K  ,  IEEE 754  fills the gaps with  Subnormals  spaced the same distance apart as the smallest  Normal  numbers:    Subnormals  [--- Normalized Numbers ----- - - -  -  -  -  -  -  -  ->                |              |                              |
三角梅花语
0-!-!-+-!-+-+-+-!-+-+-+-+-+-+-+-!---+---+---+---+---+---+---+---!------ - -  | |  |      |              |                              |
Powers of 2 :  2 2-2K              2 3-2K                              2 4-2
K
-+-    Concutive Positive Floating-Point Numbers      -+- Table of  Formats’  Parameters:
Format
Bytes  K+1 N Single
4824Double
81153 Double-Extended
≥  10 ≥  15 ≥
64( Quadruple      1615113 )
Work in Progress:                      Lecture Notes on the Status of  IEEE 754                    May 31, 1996 2:44 pm
IEEE 754  encodes floating-point numbers in memory  (not in registers)  in ways first propod by  I.B. Goldberg
in  Comm. ACM (1967) 105-6 ;  it packs three fields with integers derived from the sign,  exponent and significand of a number as follows.  The leading bit is the sign bit,  0  for  +  and  1  for  - .  The next  K+1  bits hold a biad exponent.  The last  N  or  N-1  bits hold the significand's magnitude.  To simplify the following table,  the significand  n  is dissociated from its sign bit so that  n  may be treated as nonnegative.
Encodings  of    ±2k+1-N n    into  Binary Fields :
Number Type Sign Bit  K+1 bit  Exponent Nth bit  N-1  bits of  Significand
NaNs:?  binary  1111  binary  1xxx (xxx)
SNaNs:?  binary  1111  nonzero binary  0xxx (xxx)
Infinities:±  binary  111 (11110)
Normals:±  k-1 + 2K1  nonnegative    n - 2N-1  <  2N-1 Subnormals:± 00  positive  n  <  2N-1
Zeros:±000
Note that  +0  and    -0  are distinguishable and follow obvious rules specified by  IEEE 754  even though floating-point arithmetical comparison says they are equal;  there are good reasons to do this,  some of them discusd in my  1987  paper  “ Branch Cuts ... .”  The two zeros are distinguishable arithmetically only by either  division-by-zero  ( producing appropriately signed infinities )  or el by the  CopySign  function recommended by  IEEE 754 / 854.  Infinities,  SNaNs,  NaNs  and  Subnormal  numbers necessitate four more special cas.
IEEE  Single  and  Double  have no  Nth  bit in their  significant digit  fields;  it is  “ implicit.”  680x0 / ix87  Extendeds  have an explicit  Nth  bit for historical reasons;  it allowed the  Intel 8087  to suppress the normalization of subnormals advantageously for certain scalar products in matrix computations,  but this and other features of the  8087  were later deemed too arcane to include in  IEEE 754,  and have atrophied.
Non-Extended  encodings are all  “ Lexicographically Ordered,”  which means that if two floating-point numbers in the same format are ordered  ( say  x < y ),  then they are ordered the same way w
hen their bits are reinterpreted as  Sign-Magnitude  integers.  Conquently,  processors need no floating-point hardware to arch,  sort and window floating-point arrays quickly.  ( However,  some processors rever byte-order!)  Lexicographic order may also ea the implementation of a surprisingly uful function  NextAfter(x, y)  which delivers the neighbor of  x  in its floating-point format on the side towards  y .
Algebraic operations covered by  IEEE 754,  namely  + , - ,  · ,  / ,  √  and  Binary <-> Decimal Conversion  with rare exceptions,  must be  Correctly Rounded  to the precision of the operation’s destination unless the programmer has specified a rounding other than the default.  If it does not  Overflow,  a correctly rounded operation’s error cannot exceed half the gap between adjacent floating-point numbers astride the operation’s ideal  ( unrounded )  result.  Half-way cas are rounded  to Nearest Even,  which means that the neighbor with last digit  0  is chon.  Besides its lack of statistical bias,  this choice has a subtle advantage;  it prevents prolonged drift during slowly convergent iterations containing steps like the:
While ( ... )  do { y := x+z ;  ... ;  x := y-z } .
A conquence of correct rounding  ( and  Gradual Underflow )  is that the calculation of an expressi
on  X•Y  for any algebraic operation  •  produces,  if finite,  a result  (X•Y)·( 1 + ß ) + µ  where  |µ|  cannot exceed half the smallest gap between numbers in the destination’s format,  and  |ß| < 2-N ,  and  ß·µ = 0 .  ( µ≠ 0  only when  Underflow  occurs.)  This characterization constitutes a weak model of roundoff ud widely to predict error bounds for software.  The model characterizes roundoff  weakly  becau,  for instance,  it cannot confirm that,  in the abnce of  Over/Underflow or division by zero,  -1 ≤  x/√(x2 + y2)  ≤ 1  despite five rounding errors,  though this is true and easy to prove for  IEEE 754,  harder to prove for most other arithmetics,  and can fail on a  CRAY Y-MP.
Work in Progress:                      Lecture Notes on the Status of  IEEE 754                    May 31, 1996 2:44 pm
The following table exhibits the span of each floating-point format,  and its precision both as an upper bound  2-N upon relative error  ß  and in  “ Significant Decimals.”成都庙会
Span  and  Precision  of  IEEE 754  Floating-Point Formats :
Format Min. Subnormal Min. Normal Max. Finite2-N Sig. Dec.
Single:  1.4 E-45  1.2 E-38  3.4 E38  5.96 E-8  6 - 9
人间真情Double:  4.9 E-324  2.2 E-308  1.8 E308  1.11 E-1615 - 17
Extended:≤ 3.6 E-4951≤ 3.4 E-4932≥ 1.2 E4932≤ 5.42 E-20≥ 18 - 21 ( Quadruple:  6.5 E-4966  3.4 E-4932  1.2 E49329.63 E-35  33 - 36  )
Entries in this table come from the following formulas:
Min. Positive Subnormal:23 - 2K - N
Min. Positive Normal:22 - 2K
Max. Finite:(1 - 1/2N)  22K
Sig. Dec.,at least:floor( (N-1) Log10(2) )  sig. dec.
白带多是什么原因at most:ceil( 1 + N Log10(2) )  sig. dec.
The precision is bracketed within a range in order to characterize how accurately conversion between binary and decimal has to be implemented to conform to  IEEE 754.  For instance,  “ 6 - 9 ”  Sig. Dec.  for Single  means that,  in the abnce of  OVER/UNDERFLOW,  ...
If a decimal string with at most  6 sig. dec.  is converted to  Single  and then converted back to the
same number of  sig. dec.,  then the final string should match the original.  Also, ...
If a  Single Precision  floating-point number is converted to a decimal string with at least  9 sig.
dec.  and then converted back to  Single,  then the final number must match the original.
Most microprocessors that support floating-point on-chip,  and all that rve in prestigious workstations,  support just the two  REAL*4  and  REAL*8  floating-point formats.  In some cas the registers are all  8  bytes wide,  and  REAL*4  operands are converted on the fly to their  REAL*8  equivalents when they are loaded into a register;  in such cas,  immediately rounding to  REAL*4  every  REAL*8  result of an operation upon such converted operands produces the same result as if the operation had been performed in the  REAL*4  format all the way.
But  Motorola 680x0-bad Macintoshes  and  Intel ix86-bad PCs  with  ix87-bad  ( not  Weitek’s 1167 or 3167 )  floating-point behave quite differently;  they perform all arithmetic operations in the  Extended  format,  regardless of the operands’ widths in memory,  and round to whatever precision is called for by the tting of a control word.
Only the  Extended  format appears in a  680x0’s  eight floating-point flat registers or an  ix87’s  eight floating-point stack-registers,  so all numbers loaded from memory in any other format,  floating-point or integer or  BCD,  are converted on the fly into  Extended  with no change in value.  All arithmetic operations enjoy the  Extended  range and precision.  Values stored from a register into a narrower memory format get rounded on the fly,  and may also incur  OVER/UNDERFLOW.  ( Since the register’s value remains unchanged,  unless popped off the  ix87’s stack,  misconstrued ambiguities in manuals or ill-considered  “ optimizations ”  cau some compilers sometimes wrongly to reu that register’s value in place of what was stored from it;  this subtle bug will be re-examined later under  " Precisions of Rounding "  below.)
Since the  Extended  format is optional in implementations of  IEEE 754,  most chips do not offer it;  it is available only on  Intel’s  x86/x87,  Pentium,  P6  and their clones by  AMD  and  Cyrix,  on  Intel’s 80960 KB,  on  Motorola’s 68040/60  or earlier  680x0  with  68881/2  coprocessor,  and on  Motorola’s  88110,  all with  64  sig.
Work in Progress:                      Lecture Notes on the Status of  IEEE 754                    May 31, 1996 2:44 pm
bits and  15  bits of exponent,  but in words that may be  80  or  96  or  128  bits wide when stored in memory.  This format is intended mainly to help programmers enhance the integrity of their  Single  and  Double  software,  and to attenuate degradation by roundoff in  Double  matrix computations of larger dimensions,  and can easily be ud in such a way that substituting  Quadruple  for  Extended  need never invalidate its u.  However,  language support for  Extended  is hard to find.
Multiply-Accumulate,  a  Mixed Blessing:
The  IBM Power PC  and  Apple Power Macintosh,  both derived from the  IBM  RS/6000  architecture,  purport to conform to  IEEE 754  but too often u a  “ Fud ”  Multiply-Add  instruction in a non-conforming way.  The idea behind a  Multiply-Add  ( or  “ MAC ”  for  “ Multiply-Accumulate ” )  instruction is that an expression like
±a*b ± c  be evaluated in one instruction so implemented that scalar products like
a1*b1 + a2*b2 + a3*b3 + ... + a L*b L
can be evaluated in about  L+3  machine cycles.  Many machines have a  MAC.  Beyond that,  a  Fud  MAC  evaluates  ±a*b ± c  with just one rounding error at the end.  This is done not so much t
非谓语动词讲解
o roughly halve the rounding errors in a scalar product as to facilitate fast and correctly rounded division without much hardware dedicated to it. To compute  q = x/y  correctly rounded,  it suffices to have hardware approximate the reciprocal  1/y  to veral sig. bits by a value  t  looked up in a table,  and then improve  t  by iteration thus:
t  :=  t  +  (1  -  t*y)*t  .
Each such iteration doubles the number of correct bits in  t  at the cost of two  MACs  until  t  is accurate enough to produce  q := t*x .  To round  q  correctly,  its remainder  r := x - q*y  must be obtained exactly;  this is what the  “ Fud ”  in the  Fud MAC  is for.  It also speeds up correctly rounded square root,  decimal <-> binary  conversion,  and some transcendental functions.  The and other us make a  Fud MAC  worth putting into a computer's instruction t.  ( If only division and square root were at stake we might do better merely to widen the multiplier hardware slightly in a way accessible solely to microcode,  as  TI  does in its  SPARC  chips.)
A  Fud MAC  also speeds up a grubby  “Doubled-Double”  approximation to  Quadruple-Precision  arithmetic by unevaluated sums of pairs of  Doubles.  Its advantage comes about from a  Fud MAC's  ability to evaluate any product  a*b  exactly;  first let  p :=  a*b  rounded off;  then compute  c :
= a*b - p  exactly in another Fud MAC,  so that  a*b = p + c  exactly without roundoff.  Fast but grubby  Double-Double  undermines the incentive to provide  Quadruple-Precision  correctly rounded in  IEEE 754's  style.
Fud MACs  generate anomalies when ud to evaluate  a*b ± c*d  in two instructions instead of three.  Which of  a*b  and  c*d  is evaluated and therefore rounded first?  Either way,  important expectations can be thwarted.  For example,  multiplying a complex number by its  complex conjugate  should produce a real number,  but it might not with a  Fud MAC.  If  √( q2 - p*r )  is real in the abnce of roundoff,  then the same is expected for
SQRT( q*q - p*r )
despite roundoff,  but perhaps not with a  Fud MAC.  Therefore  Fud MACs  cannot be ud indiscriminately;  there are a few programs that contain a few assignment statements from which  Fud MACs  must be banned.
By design,  a  Fud MAC  always runs faster than parate multiplication and add,  so compiler writers with one eye on benchmarks bad solely upon speed leave programmers no way to inhibit  Fud MACs  lectively within expressions,  nor to ban them from a lected assignment statement.
Ideally,  some locution like redundant parenthes should be understood to control the u of  Fud MACs  on machines that have them.  For instance,  in  Fortran,  ...
(A*B) +  C*D  and  C*D + (A*B)  should always round  A*B  first;
(A*B) + (C*D)    should inhibit the u of a  Fud MAC  here.
Something el is needed for  C ,  who  Macro Preprocessor  often insinuates hordes of redundant parenthes.  Whatever expedient is chon must have no effect upon compilations to machines that lack a  Fud MAC;  a parate compiler directive at the beginning of a program should say whether the program is intended solely for machines with,  or solely for machines without a  Fud MAC.

本文发布于:2023-05-19 14:50:33,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/82/695645.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:谓语   成都   真情   白带   讲解   庙会
相关文章
留言与评论(共有 0 条评论)
   
验证码:
推荐文章
排行榜
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图