05. IEEE 754 Floating Point Standard
The powerpoint is quite disorganized, but here’s the condensed version:
Representing floating point numbers is a bit tricky in binary. We store them in something like scientific notation, where we have a sign, ‘base’, and and exponent. In regular numbers, that looks like this: \(-6,267,000,000=-6.267\times10^9\). We have the same thing in binary. To store a number in 32 bits, we use the following layout:
(The mantissa means the ‘base’).
Converting Decimal → Binary¶
Below follows a detailed example of how to do this, but here’s the TL;DR:
To convert decimal numbers to binary IEEE FP Representation, convert the whole number portion and the fractional portion to binary, and mash them together. Then shift the ‘decimal point’ so it’s immediately to the right of the first
1
, and drop the leading one. That’s your mantissa. The amount of times you shifted left (negative, if you shifted right), plus 127, is your exponent, and the sign bit is 0 if positive, 1 if negative. For zero, set the exponent to 127, for infinity, set the exponent to 255.
Whole Numbers¶
- Divide the decimal number by 2 and take note of the result and the remainder
- Repeat step one with the result of the previous division, keeping track of the remainders, until the quotient becomes zero
- Write down the remainders in reverse order. That number is the binary representation of the decimal number. Here’s an example:
Converting \(26_{\text{Dec}}\to Binary\)
- Divide \(26\) by \(2\):
- Quotient = \(13\)
- Remainder = \(0\)
- Divide \(13\) by \(2\):
- Quotient = \(6\)
- Remainder = \(1\)
- Divide \(6\) by \(2\):
- Quotient = \(3\)
- Remainder = \(0\)
- Divide \(3\) by \(2\):
- Quotient = \(1\)
- Remainder = \(1\)
- Divide \(1\) by \(2\):
- Quotient = \(0\)
- Remainder = \(1\)
Read the remainders from bottom to top.
The binary representation of \(27\) is \(11010\).
Just for fun, here’s some Python that does the same thing, sort of:
Non-Integers¶
Converting fractional numbers from decimal to binary involves a similar process.
- Start with the fractional part of the decimal number you want to convert.
- Multiply the fractional part by 2.
- Take the integer part of the result as the first binary digit.
- Repeat steps 2 and 3 with the decimal portion of the result obtained in step 3.
- Continue this process until the fractional part becomes 0 or until you achieve the desired precision.
- The result is all of the ‘integers’ you got, in top down order. Here’s an example:
Converting \(0.625_{\text{Dec}}\to Binary\)
- Multiply \(0.625\) by \(2\):
- Result = \(1.25\)
- Take the integer part (\(1\)) as the first binary digit.
- Multiply the decimal portion (\(0.25\)) by \(2\):
- Result = \(0.5\)
- Take the integer part (0) as the second binary digit.
- Multiply the decimal portion (\(0.5\)) by \(2\):
- Result = \(1.0\)
- Take the integer part (\(1\)) as the third binary digit.
Since the decimal portion becomes \(0\) after step \(7\), we stop the process.
The binary representation of \(0.625\) is \(0.101\).
Note that the conversion may involve an infinite binary fraction for some decimal fractions, such as \(\frac13\) (\(0.\bar3\) in decimal), as they cannot be precisely represented in binary using a finite number of bits. In practice, the conversion is often rounded or limited to a certain number of binary digits based on the desired precision.
Converting Decimal → Binary Scientific Notation¶
Converting \(45.45_{\text{Dec}}\to Binary\)
Convert left hand side of number to binary:
\(45 = 22 \times 2 + 1\)
\(22 = 11 \times 2 + 0\)
\(11 = 5 \times 2 + 1\)
\(5 = 2 \times 2 + 1\)
\(2 = 1 \times 2 + 0\)
\(1 = 0 \times 2 + 1\)
Reading from bottom to top, we get \(101101\), which is \(45\) in binary.
However, when we try to follow the steps for converting the fractional part, we run into an infinite loop: the last four digits repeat forever. Therefore, we represent the result as \(01\overline{1100}\).
Great, we have our whole number portion: \(101101\), and our fractional: \(01\overline{1100}\).
The next step is to move the decimal place just after the first 1, noting how many times we move the decimal point to the left.
In our example, we need to move it +5 times to the left, (if you would’ve needed to move it to the right 5 times, it counts as -5) we get an exponent of 5 — that is, \(2^5\), and a mantissa of \(1.0110101\overline{1100}\).
Since the number is positive, our sign bit is 0. We drop the leading one of the mantissa, and let it repeat until it fills up all 23 bits of the mantissa, and that leaves \(01101011100110011001100\) as the final mantissa.
In order to store super tiny numbers, we allow for negative exponents we do this stupid thing where we ‘bias’ the exponent with -127. That means, whatever binary number is in the exponent portion will be stored as 127 more than the value it represents. In our case, we want the exponent to be 5, so to ‘counter the bias’, we add 127 to our exponent, giving us 132, which is 10000100.
Putting it all together gives us: 01000010001101011100110011001100. Putting it in the diagram:
Representing Zero and Infinity
To represent zero, use exponent 00000000
.
To represent infinity, use exponent 11111111
.
In both cases, the mantissa should be set to all zeros
Converting Binary → Decimal¶
First, look at the sign bit: if it’s 0
, the number is positive, if it’s 1
, it’s negative.
Next, take the exponent. That will be the following eight bits. Convert it to an unsigned (as in, positive) decimal number, and subtract 127 from it to un-bias the value.
Then, convert the mantissa to decimal using the following formula: \(\(\{B_1B_2B_3B_4\dots B_{23}\}_2\to \text{Base 10}=\sum_{n=1}^{23}{(B_n\cdot2^{-n})}\)\)
What this means is you multiply each bit value (either 0
or 1
) by (2 to the power of it’s position relative to the ‘decimal point’), and add them all up.
The final result is \(-1^\text{sign bit}\cdot(1+\text{mantissa})\cdot2^{\text{unbiased exponent}}\)
Let’s look at the following example:
A register has the following data: 1000 1000 1000 1000 1000 000 0000 0000
.
Looking at the sign bit, we can see it’ll be negative, since it’s 1
.
The exponent is 00010001
, which is 17 in decimal. Subtracting 127 to un-bias it gives us -110 as our exponent.
The mantissa is equal to \(2^{-4}+2^{-8}\) since only bits \(4\) and \(8\) are 1
. This is equal to \(0.06640625_{10}\).
Using our formula from earlier, plugging in the mantissa, un-biased exponent, and the sign bit:
\(\(-1^1\cdot(1+0.06640625)\cdot2^{-110}=-1.06640625\times2^{-110}=-8.2152949\times10^{-34}\)\)
… and we’re done! Nice.
Floating Point Arithmetic¶
Addition and Subtraction¶
When performing arithmetic operations on floating point numbers, the exponents need to be equal first. In base 10, if you wanted to add \(500+50,000\) in floating point representation, you would write them with an equal exponent: \(0.5\cdot10^4+5\cdot10^4=5.05\cdot10^4\)
We prefer the highest power of the summands, so if one exponent is smaller than the other, you would raise the exponent of the smaller number until it’s equal to the larger one. E.g., \(11.1\cdot2^4+0.11\cdot2^5=1.11\cdot2^5+0.11\cdot2^5=10.10\cdot2^5\)
When adding or subtracting, you make the exponents equal by shifting the decimal point.
Add the mantissas together, and round the result to 23 bits.
Normalize the result and adjust the exponent accordingly.
Check for overflow/underflow. If this occurs, make note of it. If you’re a computer, you might want to throw an error, and if you’re a student taking a test, be wary of the question and possible answers.
Set the sign bit, and put it all together in the format:
\(SIGN\ BIT_1\mid EXPONENT\ BITS_8\mid {MANTISSA}_{23}\)
Overflow and Underflow
If the exponents are too high or low, overflow or underflow may occur.
Overflow will happen if the exponent is larger than +127
Underflow will happen if the exponent is lower than -126
For subtraction, you just rewrite one of the numbers in two’s complement (flip the bits, then add one), and add like usual.
Multiplication and Division¶
To multiply two numbers in IEEE-754 format, we first un-bias the two exponents, then add them (subtract them if you’re dividing), and then re-bias the result.
Then, multiply (or divide) the mantissa. Yup, it stinks. You just gotta do it.
Normalize the result, adjusting the exponent as needed
Round the mantissa (Add the digit after the desired precision point to the digit before it, and chop off the rest)
Set the sign bit appropriately (positive if the numbers are the same sign, negative if different).
Put the bits in IEEE-754 format.