{"title":"Floating point error analysis","authors":"R. Nickerson","doi":"10.1145/612201.612262","DOIUrl":null,"url":null,"abstract":"In many floating point calculations it is important to arrange the sequence of calculations such that significant digits are not deleted by intermediate rounding of the numbers. The danger that gross errors may be introduced is always present in floating point calculations, since the number of digits carried in each number is restricted in normal operation by the design of the computer. Although the phenomenon considered here is well-known to people working with numbers, the purpose of this discussion is to provide a direct approach to the examination of calculations in which gross error may be introduced. A pitfall to be avoided is the fact that expanded versions of an expression frequently appear desirable on the surface due to cancellation of terms in the expansion.The problem is illustrated simply by the product of two differences of almost equal error-free integers. All numbers will be restricted to five digits, and numbers having surplus digits will be rounded to five digits.The direct approach yieldsx = (65432-65321) (54321-54304) = (11) (17) = 187. (1)The expanded product approach yieldsx = (65432) (54321) - (65432) (54304) - (65321) (54321) + (65321)(54304) (2)x = (3.5543 - 3.5532 - 3.5483 + 3.5472) x 109 = 0.0 (3)This result is worthless. The restriction of each number to five digits has prevented us from obtaining the correct result by the expanded form. With the aid of a simple notation, we shall show further on that the same situation prevails if the input numbers are not necessarily accurate to N digits.Normalized floating point numbers have their left-most non-zero digit residing at the immediate right of the decimal point. If leading zeros exist between the decimal point and the non-zero digits of the number, the number is termed unnormalized. Floating point operation is characterized by the automatic re-scaling of unnormalized numbers to put them in normalized form. Physically, the process is accomplished by left shifting all of the digits until the leading zeros have been completely removed into the scale factor.More than one leading zero can only be introduced by a summation process. If the sum of a set of numbers has L leading zeros, each term of the sum has up to L implicit leading zeros, and the largest has exactly L implicit leading zeros. The multiplication of two numbers, having L1 and L2 implicit leading zeros each, gives a product containing (L1+L2) implicit leading zeros. The accuracy of calculations leading to a sum of numbers is determined from the number of implicit leading zeros, L, and the total number of digits, N, carried in the calculation. If L > N, as in the above example, the result is meaningless. Calculation procedures which involve summation should thus be examined carefully to establish that an adequate number of digits is retained in the intermediate numbers to obtain the available accuracy in the result. Procedures which produce the normalized result directly are to be preferred, since the intermediate numbers have explicit leading zeros which are automatically removed into the scale factor before intermediate roundoffs are performed.","PeriodicalId":109454,"journal":{"name":"ACM '59","volume":"175 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1959-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM '59","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/612201.612262","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In many floating point calculations it is important to arrange the sequence of calculations such that significant digits are not deleted by intermediate rounding of the numbers. The danger that gross errors may be introduced is always present in floating point calculations, since the number of digits carried in each number is restricted in normal operation by the design of the computer. Although the phenomenon considered here is well-known to people working with numbers, the purpose of this discussion is to provide a direct approach to the examination of calculations in which gross error may be introduced. A pitfall to be avoided is the fact that expanded versions of an expression frequently appear desirable on the surface due to cancellation of terms in the expansion.The problem is illustrated simply by the product of two differences of almost equal error-free integers. All numbers will be restricted to five digits, and numbers having surplus digits will be rounded to five digits.The direct approach yieldsx = (65432-65321) (54321-54304) = (11) (17) = 187. (1)The expanded product approach yieldsx = (65432) (54321) - (65432) (54304) - (65321) (54321) + (65321)(54304) (2)x = (3.5543 - 3.5532 - 3.5483 + 3.5472) x 109 = 0.0 (3)This result is worthless. The restriction of each number to five digits has prevented us from obtaining the correct result by the expanded form. With the aid of a simple notation, we shall show further on that the same situation prevails if the input numbers are not necessarily accurate to N digits.Normalized floating point numbers have their left-most non-zero digit residing at the immediate right of the decimal point. If leading zeros exist between the decimal point and the non-zero digits of the number, the number is termed unnormalized. Floating point operation is characterized by the automatic re-scaling of unnormalized numbers to put them in normalized form. Physically, the process is accomplished by left shifting all of the digits until the leading zeros have been completely removed into the scale factor.More than one leading zero can only be introduced by a summation process. If the sum of a set of numbers has L leading zeros, each term of the sum has up to L implicit leading zeros, and the largest has exactly L implicit leading zeros. The multiplication of two numbers, having L1 and L2 implicit leading zeros each, gives a product containing (L1+L2) implicit leading zeros. The accuracy of calculations leading to a sum of numbers is determined from the number of implicit leading zeros, L, and the total number of digits, N, carried in the calculation. If L > N, as in the above example, the result is meaningless. Calculation procedures which involve summation should thus be examined carefully to establish that an adequate number of digits is retained in the intermediate numbers to obtain the available accuracy in the result. Procedures which produce the normalized result directly are to be preferred, since the intermediate numbers have explicit leading zeros which are automatically removed into the scale factor before intermediate roundoffs are performed.