Neural networks, on the basis of which modern systems of deep machine learning and artificial intelligence are built, in most cases use the standard 32-bit IEEE FP32 floating-point number format. This provides high accuracy of calculations and the final result, but requires the use of large amounts of memory and high-performance processors that consume a significant amount of energy. On systems with high performance requirements and limited computing resources, 8-bit integers with the INT8 sign are used. This allows you to get high performance artificial intelligence systems, sacrificing the accuracy of calculations and the final result.

To solve the problem of a compromise between the performance of artificial intelligence systems and the bit depth of the numbers used, Google Brain specialists at one time developed a special format for floating point numbers, optimized for deep learning and allowing you to get the result with the least possible loss of accuracy. This format, BF16 (BFloat16, Brain Float 16) has already found wide application in special hardware accelerators developed by Google, Intel, ARM, etc.

What is the difference between FP32 and BF16? A floating point number in the standard FP32 format consists of 1 sign bit that defines the sign of the number (+ or -), followed by an 8-bit exponent (the power of the number), followed by a 23-bit mantissa (the number itself). And in total, the full 32 bits are recruited.

For the BF16 format, Google Brain experts proposed truncating the mantissa to 7 bits. Such a choice was far from accidental; the experiments showed that the quality of neural networks is much more sensitive to the size of the exponent than mantissa. And the BF16 variant is the most acceptable compromise.

Thus, the number in BF16 consists of one sign bit, an 8-bit exponent and a 7-bit mantissa, which in total is a total of 16 bits. To perform tensor operations with numbers in the BF16 format, much less computational power, memory, and energy are required. We remind our readers that the tensor is a three-dimensional matrix of numbers, and the multiplication of tensors is a key operation on which all calculations in artificial intelligence systems are based.

One may ask, why not use the standard truncated IEEE FP16 floating-point number format in artificial intelligence systems? After all, this format is quite successfully used in many applications related to computer graphics and computer games. A number in FP16 format is as follows – one sign bit, a 5-bit exponent and a 10-bit mantissa, which in total is a full 16 bits.

It can immediately be noted that the truncated exponent defines a much smaller dynamic range of numbers than the range of the BF16 format, which is almost equal to the range of FP32. Secondly, converting the FP32 number to the FP16 format requires a rather complicated procedure than the simplest operation for trimming the mantissa bit needed to convert FP32 to BF16. And thirdly, the truncated mantissa of the BF16 format allows you to make the bus of the blocks of hardware multipliers smaller than necessary for the FP16 format. As a result of this, the multiplier blocks themselves for the BF16 format are eight times smaller in terms of the area occupied on the chip chip than the multipliers for FP32, and two times smaller than the multipliers for FP16.

In conclusion, it should be noted that the BF16 format is not the only format developed for artificial intelligence systems. In 2017, Nervana introduced a format called Flexpoint, which was supposed to combine all the benefits of integers with floating point numbers. In essence, this format is a modification of fixed-point numbers that consist of two integers. The first number is the integer part of the number before the decimal point, the second is the fractional part of the number (after the decimal point). Nervana specialists supplemented the fixed-point number format with an exponent as well, however, to speed up the operations, the exponent was one for all the numbers that make up the tensor. With this approach, tensor multiplication could be done using mathematics based on very fast integer operations.

But then a problem arose that related to the narrow dynamic range of numbers having the same exponent. This problem made it impossible to “take off” the Flexpoint format, and even the first Nervana accelerators already used the BF16 format.