Floating point error analysis

ACM '59 Pub Date : 1959-09-01 DOI:10.1145/612201.612262

R. Nickerson

{"title":"Floating point error analysis","authors":"R. Nickerson","doi":"10.1145/612201.612262","DOIUrl":null,"url":null,"abstract":"In many floating point calculations it is important to arrange the sequence of calculations such that significant digits are not deleted by intermediate rounding of the numbers. The danger that gross errors may be introduced is always present in floating point calculations, since the number of digits carried in each number is restricted in normal operation by the design of the computer. Although the phenomenon considered here is well-known to people working with numbers, the purpose of this discussion is to provide a direct approach to the examination of calculations in which gross error may be introduced. A pitfall to be avoided is the fact that expanded versions of an expression frequently appear desirable on the surface due to cancellation of terms in the expansion.The problem is illustrated simply by the product of two differences of almost equal error-free integers. All numbers will be restricted to five digits, and numbers having surplus digits will be rounded to five digits.The direct approach yieldsx = (65432-65321) (54321-54304) = (11) (17) = 187. (1)The expanded product approach yieldsx = (65432) (54321) - (65432) (54304) - (65321) (54321) + (65321)(54304) (2)x = (3.5543 - 3.5532 - 3.5483 + 3.5472) x 109 = 0.0 (3)This result is worthless. The restriction of each number to five digits has prevented us from obtaining the correct result by the expanded form. With the aid of a simple notation, we shall show further on that the same situation prevails if the input numbers are not necessarily accurate to N digits.Normalized floating point numbers have their left-most non-zero digit residing at the immediate right of the decimal point. If leading zeros exist between the decimal point and the non-zero digits of the number, the number is termed unnormalized. Floating point operation is characterized by the automatic re-scaling of unnormalized numbers to put them in normalized form. Physically, the process is accomplished by left shifting all of the digits until the leading zeros have been completely removed into the scale factor.More than one leading zero can only be introduced by a summation process. If the sum of a set of numbers has L leading zeros, each term of the sum has up to L implicit leading zeros, and the largest has exactly L implicit leading zeros. The multiplication of two numbers, having L1 and L2 implicit leading zeros each, gives a product containing (L1+L2) implicit leading zeros. The accuracy of calculations leading to a sum of numbers is determined from the number of implicit leading zeros, L, and the total number of digits, N, carried in the calculation. If L > N, as in the above example, the result is meaningless. Calculation procedures which involve summation should thus be examined carefully to establish that an adequate number of digits is retained in the intermediate numbers to obtain the available accuracy in the result. Procedures which produce the normalized result directly are to be preferred, since the intermediate numbers have explicit leading zeros which are automatically removed into the scale factor before intermediate roundoffs are performed.","PeriodicalId":109454,"journal":{"name":"ACM '59","volume":"175 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1959-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM '59","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/612201.612262","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In many floating point calculations it is important to arrange the sequence of calculations such that significant digits are not deleted by intermediate rounding of the numbers. The danger that gross errors may be introduced is always present in floating point calculations, since the number of digits carried in each number is restricted in normal operation by the design of the computer. Although the phenomenon considered here is well-known to people working with numbers, the purpose of this discussion is to provide a direct approach to the examination of calculations in which gross error may be introduced. A pitfall to be avoided is the fact that expanded versions of an expression frequently appear desirable on the surface due to cancellation of terms in the expansion.The problem is illustrated simply by the product of two differences of almost equal error-free integers. All numbers will be restricted to five digits, and numbers having surplus digits will be rounded to five digits.The direct approach yieldsx = (65432-65321) (54321-54304) = (11) (17) = 187. (1)The expanded product approach yieldsx = (65432) (54321) - (65432) (54304) - (65321) (54321) + (65321)(54304) (2)x = (3.5543 - 3.5532 - 3.5483 + 3.5472) x 109 = 0.0 (3)This result is worthless. The restriction of each number to five digits has prevented us from obtaining the correct result by the expanded form. With the aid of a simple notation, we shall show further on that the same situation prevails if the input numbers are not necessarily accurate to N digits.Normalized floating point numbers have their left-most non-zero digit residing at the immediate right of the decimal point. If leading zeros exist between the decimal point and the non-zero digits of the number, the number is termed unnormalized. Floating point operation is characterized by the automatic re-scaling of unnormalized numbers to put them in normalized form. Physically, the process is accomplished by left shifting all of the digits until the leading zeros have been completely removed into the scale factor.More than one leading zero can only be introduced by a summation process. If the sum of a set of numbers has L leading zeros, each term of the sum has up to L implicit leading zeros, and the largest has exactly L implicit leading zeros. The multiplication of two numbers, having L1 and L2 implicit leading zeros each, gives a product containing (L1+L2) implicit leading zeros. The accuracy of calculations leading to a sum of numbers is determined from the number of implicit leading zeros, L, and the total number of digits, N, carried in the calculation. If L > N, as in the above example, the result is meaningless. Calculation procedures which involve summation should thus be examined carefully to establish that an adequate number of digits is retained in the intermediate numbers to obtain the available accuracy in the result. Procedures which produce the normalized result directly are to be preferred, since the intermediate numbers have explicit leading zeros which are automatically removed into the scale factor before intermediate roundoffs are performed.

查看原文本刊更多论文

浮点误差分析

在许多浮点计算中，重要的是安排计算序列，使有效数字不会因中间四舍五入而被删除。在浮点计算中，由于计算机的设计限制了正常操作中每个数字所携带的位数，因此可能出现严重误差的危险始终存在。尽管这里所考虑的现象对于从事数字工作的人来说是众所周知的，但本讨论的目的是提供一种直接的方法来检查可能引入严重误差的计算。要避免的一个陷阱是，表达式的展开版本由于在展开中取消了项而经常在表面上看起来是理想的。这个问题可以简单地用两个几乎相等的无误差整数之差的乘积来说明。所有的数字将被限制为五位数字，有多余数字的数字将被四舍五入到五位数字。直接逼近yieldsx =(65432-65321)(54321-54304) =(11)(17) = 187。(1)扩展积方法yieldsx = (65432) (54321) - (65432) (54304) - (65321)(54321) + (65321)(54304) (2)x = (3.5543 - 3.5532 - 3.5483 + 3.5472) x 109 = 0.0(3)这个结果是没有价值的。每个数只能有五位数字的限制使我们无法用展开形式得到正确的结果。在一个简单符号的帮助下，我们将进一步说明，如果输入的数字不一定精确到N位，也会出现同样的情况。规范化浮点数的最左边的非零数字位于小数点的右边。如果该数的小数点和非零位数之间存在前导零，则该数称为非规范化数。浮点运算的特点是自动重新缩放非规格化的数字，使它们成为规格化的形式。物理上，这个过程是通过左移所有数字来完成的，直到前导零被完全移到比例因子中。一个以上的前导零只能通过求和过程引入。如果一组数字的和有L个前导零，则和的每一项最多有L个隐式前导零，并且最大的项正好有L个隐式前导零。两个数字的乘法，每个都有L1和L2的隐式前导零，得到一个包含(L1+L2)隐式前导零的乘积。得出一个数和的计算的准确性取决于计算中隐含的前导零的数量L和总位数N。如果L > N，如上例所示，结果是无意义的。因此，应仔细检查涉及求和的计算程序，以确保在中间数中保留足够的位数，以获得结果的可用准确性。直接产生规范化结果的过程是首选的，因为中间数字有明确的前导零，在执行中间四舍五入之前，这些前导零被自动移到比例因子中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM '59

自引率

0.00%

发文量