• Welcome to the new Internet Infidels Discussion Board, formerly Talk Freethought.

Computer Digital Arithmetic

steve_bank

Diabetic retinopathy and poor eyesight. Typos ...
Joined
Nov 9, 2017
Messages
13,831
Location
seattle
Basic Beliefs
secular-skeptic
All math operations on a digital computer reduce to binary operation.

The number 123.456 has an integer and fractional part. The integr part is easy to see in binary. A simple binary count.

Decimal numbers a re represented by muliples of the least sugnificant digit or lsb.


For any number of bits the only time the binary imnage of a decimal number is exact is when the decimal number is multiples of the lsb. The more bits the better the approximation.

Customarily binmary numbers are shown from the msb on the left to the lsb on the right.
8 bits [7 6 5 4 3 2 1 0]
n = number of bits
B0 is the lasb

The rresoluion or lsb value is 1/(2^n)

for 8 bits the lasb = 1/2^8 = .0039062

The fractional value is
lsb*2^7 + lsb*2^6 + lsb*2^5+ lsb*2^4+ lsb*2^3+ lsb*2^2 + lsb*2^1 + lsb*2^0

The sum of the weighted bits will be 1 – lsb. Adding an lsb rolls over to an integr of 1.

This is essential digitizing the number 1. Languages have built in decimal to binary and binary to decimal functions.

Scilab does not have bit operations like has so it is a round about solution.
 
First is binary to fractional decimal.

function [val] = _bin2dec(x,k)
// k = 1 integer count, 0 fractional count
// bw binary weights
nbits = length(x)
lsb = 1/(2^nbits)
val = 0
j = nbits
k = 0
for i = 1:nbits
bw(i) = lsb*(2^k)
val = val + (bw(i) * x(j)
j = j - 1
k = k + 1
end
endfunction


a = [ 0 0 0 0 0 0 0 1]
decval = _bin2dec(a.0)
The result is the lasb

a= [ 1 1 0 0 0 0 0 0]
mprintf(“dec value %f\n”,decval)
Result is 0.75
 
Next is going from a decimal value to a digitized approximation. The starting value is itself an approximation within Scliab, but at 64 bits.

For any number of bits the only time the binary imnage of a decimal number is exact is when the decimal number is multiples of the lsb. The more bits the better the approximation.


On the Scliab command line
f= 0.8888
When I ask for the variable back I get
0.8888000000000000115463

The next routine goes from decimal to binary digitization. It is a b simple brute force method, not efficient and takes longer as n goes up. It sums the bits until the summation is > the target value and then retrns one dcount back.

function [x] = _dec2bin(decval,n)
lsb = 1/2^n
ylast = 0
for i = 1:2^n
ynow = i * lsb
if(i > 1) ylast = (i - 1) * lsb;end;
if(ynow > decval);break;end;
end
if(decval == 0) then
x = 0
elseif (i > 1)
x = ylast
else
x = lsb
end

endfunction
nb = 8
num = .8888
dval = _dec2bin(num,nb) // percent relative error
relerror = ((num - dval)/num) * 100 // relative error to Scilab 64 bits
mprintf("nb %d err %f num %f val %1.25f \n",nb,relerror,num,dval)

num is the traget value
val is the digital approximation for the given number of bits
nb number of bits
er is relative error
nb 8 err 0.234164 num 0.888800 val 0.8867187500000000000000000
nb 16 err 0.000681 num 0.888800 val 0.8887939453125000000000000
nb 25 err 0.000001 num 0.888800 val 0.8887999951839447021484375

With an interpreter like Scilab it takes too long to go up to 64 bits.
 
I'll mention various workarounds for the problem of rounding errors.

(1) IBM mainframes had 'packed decimal' instructions which, I think, allowed arithmetic on decimal numbers of up to 31 digits! If dollar amounts are represented to the nearest tenth of a penny, only 17 digits are needed to depict the U.S. National Debt! So that format can withstand some bouts of hyperinflation before 'packed decimal' fails. (IIRC some computer instruction sets imposed no upper limit on the length of packed decimal numbers!)

(2) Of course, binary representation is good enough if we insist that all results are multiples of a mil or whatever. One reason IBM used packed decimal is that much software in earlier times did only trivial amounts of arithmetic: it would have wasted time to convert to/from binary. (Another reason might have been the COBOL philosophy: Let non-tech people see exactly what's happening.)

(3) "Unlimited" precision is an option. Even GCC supports this; and I suppose it's trivial to use in languages like C++.

(4) Rational-number arithmetic is an approach that often allows exact answers without resorting to unlimited precision. (If anyone needs source code to find exact rational-number solutions to minimax games, PM me.)

(5) Finally, the  Kahan summation algorithm is a good "trick" to know when summing long series. The coding is an interesting example where compiler optimization must be disabled.
 
Binary addition works like base 10.

Binary addition has the AND truth table, and the addition s done in the processor .

In a high level language you might write

int a,b,c;

a = 2
b = 3
c = a + b

Processors have general purpose registers and a general result location called an accumaulator, a move
instriction transfers data betwen processor registers and memeory.

The compiler might generate assembly language code like this

MOV A -> REG1
MOV B -> REG2
ADDC REG1,REG2 ..ADD WITH CARRY
MOV ACC -> C

where A,B,C are physical memory locations

The assembler gnerates machine code, a sequential binary list.

With anadd with carry yiu can chain additions to any length. In a 64 bit machine you can add a 128 bit intteger. Load and add the lower 64 bits, then load and add the high 64 bits.

function [x,carry] = _unsign_add(a,b)
//unsugned binary addition
//1 + 0 = 0, 0 + 1 = 0, 1 + 1 = 0 carry 1
//carry out = 1 indicates overflow
nb = length(a)
carry = 0
for i = 1:nb
if(a(nb + 1 -i) == 0 && b (nb + 1 -i) == 0) then
x(nb+1-i) = 0 + carry
carry = 0
elseif(a(nb+1-i)==1 && b(nb+1-i)==0) || (a(nb+1-i)==0 && b(nb+1-i)==1) then
x(nb+1-i) = 1 + carry
carry = 0
if(x(nb+1-i) == 2) x(nb+1-i)= 0;carry = 1;end;
else x(nb+1-i) = 0 + carry
carry = 1
end
end
endfunction

a = [0 0 0 0 0 0 1 1]
b = [0 0 0 0 0 0 0 1]

[x,c] = _unsign_add(a,b)
 
I'll mention various workarounds for the problem of rounding errors.

(1) IBM mainframes had 'packed decimal' instructions which, I think, allowed arithmetic on decimal numbers of up to 31 digits! If dollar amounts are represented to the nearest tenth of a penny, only 17 digits are needed to depict the U.S. National Debt! So that format can withstand some bouts of hyperinflation before 'packed decimal' fails. (IIRC some computer instruction sets imposed no upper limit on the length of packed decimal numbers!)

(2) Of course, binary representation is good enough if we insist that all results are multiples of a mil or whatever. One reason IBM used packed decimal is that much software in earlier times did only trivial amounts of arithmetic: it would have wasted time to convert to/from binary. (Another reason might have been the COBOL philosophy: Let non-tech people see exactly what's happening.)

(3) "Unlimited" precision is an option. Even GCC supports this; and I suppose it's trivial to use in languages like C++.

(4) Rational-number arithmetic is an approach that often allows exact answers without resorting to unlimited precision. (If anyone needs source code to find exact rational-number solutions to minimax games, PM me.)

(5) Finally, the  Kahan summation algorithm is a good "trick" to know when summing long series. The coding is an interesting example where compiler optimization must be disabled.

Thanks for contributing.
 
Subtraction is done with ‘twos complement’ Thr basic idea can be demonstrted with base 10.

the complent of 3 is 10 – 3 = 7
7 – 3 = 7 + 7 = 14, drop the carry.
Subtraction changed to an addition.

In hardware a dgital suntractor is more complicated than addition. 2S complement is starighforward.
Additionalu up-down counters are used in processors. If an 8 bit counter at 00000000 is incremented down it rolls over to 11111111, which is -1 in 2s complement. 2S compement counts up from the lowest negative number to 0 and from 0 to the highest poitive number.

For a 4 bit suged number with B4 the sign bit the range is

0111 7
0110 6
0101 5
0100 4
0011 3
0010 2
0001 1
0000 0
1111 -1
1110 -2
1101 -3
0100 -4
1011 -5
1010 -6
1001 -7

2s complemet—invert the bits and add 1.

010 2 decimal 2
101 invert
110 add 1, 2s complement


The 2s complement of 2, 110 , is -2 from the table.

Subtracting 6 – 2
0110 6
0010 2 -
1101 invert 2
1110 add 1 2s complement
0111 add 7 to 2s complement
0101 result, +5



Adding negative numbers

1 – 2 = -1
0001 1
0010 2
1110 invert 2 , aad 1 2s comp
0001 add 1
1111 result = -1 in 2s complement

clear
function [x,carry] = _unsign_add(a,b)
//unsugned binary addition
//1 + 0 = 0, 0 + 1 = 0, 1 + 1 = 0 carry 1
//carry out = 1 indicates overflow
nb = length(a)
carry = 0
for i = 1:nb
if(a(nb + 1 -i) == 0 && b (nb + 1 -i) == 0) then
x(nb+1-i) = 0 + carry
carry = 0
elseif(a(nb+1-i)==1 && b(nb+1-i)==0) || (a(nb+1-i)==0 && b(nb+1-i)==1) then
x(nb+1-i) = 1 + carry
carry = 0
if(x(nb+1-i) == 2) x(nb+1-i)= 0;carry = 1;end;
else x(nb+1-i) = 0 + carry
carry = 1
end
end
endfunction

function [x] = _2s_comp(y,n)
for i = 1:n // invert 1s complement
if(y(i) == 0 ) then y(i) = 1;
else y(i) = 0
end
end
b = zeros(4,1)
b(n) = 1;// add 1 , 2s complement
[x,c] = _unsign_add(y,b)
endfunction

a = [0 1 1 0] // a - b
b = [0 0 1 0]
n = length(a)
bc = _2s_comp(b,n)
[s,c] = _unsign_add(a,bc)
sc = _2s_comp(s,n)
disp(s')
disp(sc')
 
Do any machines still use one's-complement? The old CDC supercomputers like the 6600 did.

One's complement corresponds to 9's-complement in decimal; the 9's-complement of 666 is 333 rather than 334. In binary, where 0001, 0010, 0011 represent 1, 2, 3, the negative numbers -1, -2, -3 are represented as 1110, 1101, 1100. As you see, the additive inverse in one's-complement is just the same as the bit-wise boolean inverse.

Various trade-offs arise when comparing 1's- and 2's-complement. The latter gives some negative numbers which cannot be inverted! (+127 is the largest signed character, but -128 exists. Similarly, the code short j = -32768, k = -j; will leave -32768 in BOTH j and k.)

On the other hand, with 1's-complement, an end-around carry is needed. +31 (011111) and -15 (110000) sum to +16 (010000) but when you do the bit-by-bit addition you get 001111 with a carry-out. That carry-out has to loop back conceptually: 001111 + 1 = 010000 (+16).

In one's-complement, the all-one's pattern (111111) is called 'negative zero' and usually behaves like the normal zero arithmetically. The CDC machines used 'complement-recomplement' arithmetic. This arranged that negative zero would never arise from any arithmetic except for the specific case of adding negative zero to negative zero — that produced negative zero. (To get a negative zero to start with you could use the CDC's MX instruction.)

This zero anomaly led to a bug in the library save/restore registers routines! SX6 B1 was the first step in saving B1. (This is shorthand for SX6 B1+B0 where B0 is a pseudo-register hard-wired to be all zeros.) Some programmers (including me!) were deliberately using negative-zeros, e.g. for flag optimizations, and the save/restore routines were turning our negative zeroes into positive zeroes. The fix was to replace SX6 B1 with SX6 B1-B0.)


Wow! It's been over 50 years since I coded for a CDC machine, yet I still remember the Save/Restore code! ... And I can't remember what I had for breakfast. :confused:
 
Here's another weird reminiscence; this time only 45 years old.

The IBM S360 and S370 had two register-register add instructions: AR and ALR. The results of the two instructions were EXACTLY the same, EXCEPT for setting of the condition code bits (and the optional overflow interrupt). (ALR was used to support the carrying needed in wide arithmetic.) But in a majority of cases, the programmer knows overflow won't occur and doesn't care about the condition code. Should he use AR or ALR?

It turns out that if you're anal-retentive about maximizing speed, and your code will run on the Model 145 you might want to use ALR. The implementing microcode was one microinstruction shorter.

I ended up knowing quite a bit about the 145 with its elegant architecture. I think I've already mentioned using a 100-picoFarad capacitor to "fix" a 145 that belonged to the Coca-Cola Company in Atlanta.
 
The number f0rmats are fixed point and floating point. The artitmetic binary opertaions are the same, the difference is in floating point the decimal point can move.

In applications where you know the the bounds of the numbers and accuracy before hane fixed point runs faster and takes less spece.

An 8 bit number with 1 sign bit, 2 integer bits, and 5 fractional bits. In a sensing application you may only need 1 for an integer, The integer part in decimal is 0,1,2,3. The fractional part runs from the lsb to 1-lsb.

The fixed point number as above numericaly evelusates as

Sign_bit 2^1 + 2^0 + lsb * 2^4 + lsb * 2^3 + lsb * 2^2 + lsb * 2^1 + lsb * 2^0

Binary operations like addition and subtraction operate on the byte, the difernce is how the bits are interpreted. Same with floating point. All math reduces to the same basic set of bit operations at the processor level.

A = [0 0 0 1 0 0 0 0] = 0.5
b = [0 0 0 1 0 0 0 0] = 0.5
a + b = [0 0 1 0 0 0 0 0] = 1.0

Precison and numer size can be increased to the limits of memeor. The early processors were limited to 8 bits.

1 sign bit, 7 integer bits, 8 fractional bits.
aint = [0 0 0 0 0 0 0 0]
afrac = [0 0 0 0 0 0 0 0]
bint = [0 0 0 0 0 0 0 0]
bfrac = [0 0 0 0 0 0 0 0]


clear
///////////////////////////////////////
function [x,carry] = _unsign_add(a,b)
//unsugned binary addition
//1 + 0 = 0, 0 + 1 = 0, 1 + 1 = 0 carry 1
//carry out = 1 indicates overflow
nb = length(a)
carry = 0
for i = 1:nb
if(a(nb + 1 -i) == 0 && b (nb + 1 -i) == 0) then
x(nb+1-i) = 0 + carry
carry = 0
elseif(a(nb+1-i)==1 && b(nb+1-i)==0) || (a(nb+1-i)==0 && b(nb+1-i)==1) then
x(nb+1-i) = 1 + carry
carry = 0
if(x(nb+1-i) == 2) x(nb+1-i)= 0;carry = 1;end;
else x(nb+1-i) = 0 + carry
carry = 1
end
end
endfunction
////////////////////////////////////
function [x] = _2s_comp(y,n)
for i = 1:n // invert 1s complement
if(y(i) == 0 ) then y(i) = 1;
else y(i) = 0
end
end
b = zeros(4,1)
b(n) = 1;// add 1 , 2s complement
[x,c] = _unsign_add(y,b)
endfunction
///////////////////////////////////////
function [val] = _fixpt2dec(x,_intbits,_fracbits)
// eval int and frac bits and sum to decimal
nbits = length(x)
flsb = 1/(2^_fracbits)
val = 0
k = _fracbits - 1
for i = (nbits -_fracbits + 1):nbits
val = val + (flsb * (2^k) * x(i))
k = k - 1
end
k = _intbits - 1
for i = 2 : (_intbits + 1)
val = val + (x(i) * 2^k)
k = k - 1
end
if(x(1) == 1) val = val * -1;end;
endfunction
///////////////////////////////////////

format(10)
// 2 byte fixed point, 7 integr bits, 8 fractional bits
aint = [0 0 0 0 0 0 0 1]
afrac = [1 0 0 0 0 0 0 0]
bint = [0 0 0 0 0 0 0 0]
bfrac = [0 0 0 0 0 0 0 0]

fbits = 5
ibits = 2
// single byte 8 it fixed point 2 bit integrr 5 bit fracrioal
// signbit 2^1 + 2^0 + 2^4 * lsb+ 2^3 * lsb+ 2^2 * lsb+ 2^1 * lsb + 2^0 * lsb
a = [0 0 1 1 1 0 0 0]
b = [0 0 0 1 0 0 0 0]

nb = length(a)

// add a + b
[c,car] = _unsign_add(a,b)
decval = _fixpt2dec(c,ibits,fbits)
mprintf('a + b: %f\n',decval)

// subtract a - b
b2c = _2s_comp(b,nb)
[d,car] = _unsign_add(a,b2c)
vs = _fixpt2dec(d,ibits,fbits)
mprintf('a - b: %f\n',vs)
 
Over the decades, these numbers of bits have been used to represent integers in binary fashion:

4, 6, 8, 9, 11, 12, 15, 16, 18, 24, 25, 26, 32, 36, 40, 48, 50, 60, 64, 72, 75, 100

But nearly all chip CPU's use powers of two:

4, 8, 16, 32, 64

They also almost universally use twos-complement signed integers.

One can implement numbers outside the range supported in one's hardware by using software. Some integer-only CPU chips implement floating point in this way, with software libraries.  Floating-point unit
 
On the subject of unusual representations of numbers:

Sometimes Ten/Eleven/Eleven bit arithmetic is used to convert quickly from one color space to another on 32-bit machines:
/* Convert an (8,8,8) RGB value to (10,11,11) LUV value using pre-computed tables. */
int R2yuv[256], G2yuv[256], B2yuv[256];
int Luv = R2yuv[red] + G2yuv[green] + B2yuv[blue]​

On this topic, what does the following code snippet do?
Code:
#define LSBs    (1 | 1 << 11 | 1 << 22)
int foo(int x, int cnt)
{
        int signs = x & LSBs >> 1;
        return ((x & ~((LSBs << cnt) - LSBs))
                        | (signs << cnt + 1) - signs)
                >> cnt;
}
 
On the subject of unusual representations of numbers:

Sometimes Ten/Eleven/Eleven bit arithmetic is used to convert quickly from one color space to another on 32-bit machines:
/* Convert an (8,8,8) RGB value to (10,11,11) LUV value using pre-computed tables. */
int R2yuv[256], G2yuv[256], B2yuv[256];
int Luv = R2yuv[red] + G2yuv[green] + B2yuv[blue]​

On this topic, what does the following code snippet do?
Code:
#define LSBs    (1 | 1 << 11 | 1 << 22)
int foo(int x, int cnt)
{
        int signs = x & LSBs >> 1;
        return ((x & ~((LSBs << cnt) - LSBs))
                        | (signs << cnt + 1) - signs)
                >> cnt;
}

Lsbs are a bit mask, I assume sign bits.
Sign bits are extracted.
The sugb bits ae set to zero.
X is shifed and the sign bits are put back in.

Don’t know why the addition and subtraction. Don't know the bit formats.
 
Back
Top Bottom