Dynamic Partitioned Shift

This is almost as complex as multiplication, except there is a trick that can be deployed. In the partitioned multiplier, it is necessary to compute a full NxN matrix of partial multiplication results, then perform a cascade of adds, using PartitionedAdd to "automatically" break them down into segments.

Partitioned Shifting will also require to have an NxN matrix, however it is slightly different. first, define the following:

a0 = a[7..0], a1 = a[15..8], ....
b0 = b[0+4..0], b1 = b[8+4..8], b2 = b[16+4..16], b3 = b[24+4..24]

QUESTION: should b1 be limited to min(b[8+4..8], 24), b2 be similarly limited to 15, and b3 be limited/min'd to 8?

then, we compute the following matrix, with the first column output being the full width (32 bit), the second being only 24 bit, the third only 16 bit and finally the top part (comprising the most significant byte of a and b as input) being only 8 bit

| a0 << b0 | a1 << b0 | a2 << b0 | a3 << b0
|          | a1 << b1 | a2 << b1 | a3 << b1
|          |          | a2 << b2 | a3 << b2
|          |          |          | a3 << b3

Where multiply would perform a cascading-add across those partial results, shift is different in that we know (assume) that for each shift-amount (operand b), within each partition the topmost bits are zero.

This because, in the typical 64-bit shift, the operation is actually:

result[63..0] = a[63..0] << b[5..0]

NOT b[63..0], i.e. the amount to shift a 64-bit number by by is in the lower six bits of b. Likewise, for a 32-bit number, this is 5 bits.

Therefore, in principle, it should be possible to simply use Muxes on the partial-result matrix, ORing them together. Assuming (again) a 32-bit input and a 4-way partition:

out0 = p00[7..0]
out1 = pmask[0] ? p01[7..0] : p00[15..8]

Table based on partitions:

p2p1p0 o0 o1 o2 o3
++++++ ++++++++ ++++++++ ++++++++ ++
0 0 0 a0b0 a1b0 a2b0 a3b0
0 0 1 a0b0 a1b1 a2b1 a3b1
0 1 0 a0b0 a1b0 a2b2 a3b2
0 1 1 a0b0 a1b1 a2b2 a3b2
1 0 0 a0b0 a1b0 a2b0 a3b3
1 0 1 a0b0 a1b1 a2b1 a3b3
1 1 0 a0b0 a1b0 a2b2 a3b3
1 1 1 a0b0 a1b1 a2b2 a3b3

For o0 the output is simple: a0b0 for all partition permutations.

o0 = a0b0[7:0]

The output for o1 looks something like this:

p2p1p0 : o1
0 0 0  : a0b0[15:8] | a1b0[7:0]
0 0 1  | a0b0[15:8] | a1b1[7:0]
0 1 0  | a0b0[15:8] | a1b0[7:0]
0 1 1  | a0b0[15:8] | a1b1[7:0]
1 0 0  | a0b0[15:8] | a1b0[7:0]
1 0 1  | a0b0[15:8] | a1b1[7:0]
1 1 0  | a0b0[15:8] | a1b0[7:0]
1 1 1  | a0b0[15:8] | a1b1[7:0]

if True:           o1  = a0b0[15:8]
if ~p0:            o1 |= a1b0[7:0]
if  p0:            o1 |= a1b1[7:0]

For o2:

p2p1p0 : o2
0 0 0  | a0b0[23:16] | a1b0[15:8] | a2b0[7:0]
0 0 1  | a0b0[23:16] | a1b1[15:8] | a2b1[7:0]
0 1 0  | a0b0[23:16] | a1b0[15:8] | a2b2[7:0]
0 1 1  | a0b0[23:16] | a1b1[15:8] | a2b2[7:0]
1 0 0  | a0b0[23:16] | a1b0[15:8] | a2b0[7:0]
1 0 1  | a0b0[23:16] | a1b1[15:8] | a2b1[7:0]
1 1 0  | a0b0[23:16] | a1b0[15:8] | a2b2[7:0]
1 1 1  | a0b0[23:16] | a1b1[15:8] | a2b2[7:0]

therefore:

if True:          o2 =  a0b0[23:16]
if ~p0:           o2 |= a1b0[15:0]
if  p0:           o2 |= a1b1[15:0]
if ~p0&~p1:       o2 |= a2b0[7:0]
if  p0&~p1:       o2 |= a2b1[7:0]
if     ~p1:       o2 |= a2b2[7:0]

For o3:

p2p1p0 | o3
0 0 0  | a0b0[31:24] | a1b0[23:16] | a2b0[15:8] | a3b0[7:0]
0 0 1  | a0b0[31:24] | a1b1[23:16] | a2b1[15:8] | a3b1[7:0]
0 1 0  | a0b0[31:24] | a1b0[23:16] | a2b2[15:8] | a3b2[7:0]
0 1 1  | a0b0[31:24] | a1b1[23:16] | a2b2[15:8] | a3b2[7:0]
1 0 0  | a0b0[31:24] | a1b0[23:16] | a2b0[15:8] | a3b3[7:0]
1 0 1  | a0b0[31:24] | a1b1[23:16] | a2b1[15:8] | a3b3[7:0]
1 1 0  | a0b0[31:24] | a1b0[23:16] | a2b2[15:8] | a3b3[7:0]
1 1 1  | a0b0[31:24] | a1b1[23:16] | a2b2[15:8] | a3b3[7:0]

therefore:

if True:          o3 =  a0b0[31:24]
if ~p0:           o3 |= a1b0[23:16]
if  p0:           o3 |= a1b1[23:16]
if ~p0& p1:       o3 |= a2b1[15:8]
if  p0& p1:       o3 |= a2b0[15:8]
if      p1:       o3 |= a2b2[15:8]
if ~p0&~p1&~p2:   o3 |= a3b0[7:0]
if  p0&~p1&~p2:   o3 |= a3b1[7:0]
if      p1&~p2:   o3 |= a3b2[7:0]
if          p2:   o3 |= a3b3[7:0]

Code that prints the above:

def decoderange(col, k):
    xs = (col-k)*8
    return "%d:%d" % (xs+7, xs)

def decodelist(l):
    res = []
    for (idx, sgn) in l:
        sgn = " " if sgn else "~"
        res.append("%sp%d" % (sgn, idx))
    return '&'.join(res)

N = 4
M = 4

for col in range(M):
    r = decoderange(col, 0)
    print "if True: o%d = a0b0[%s]" % (col, r)
    for i in range(1,col+1):
        l = []
        for j in range(i):
            l.append([j, False])
        k = 0
        s = decodelist(l)
        r = decoderange(col, i)
        print "if %s: o%d |= a%db%d[%s]" % (s, col, i, k, r)
        k += 1
        while l:
            l[0][1] = True
            s = decodelist(l)
            r = decoderange(col, i)
            print "if %s: o%d |= a%db%d[%s]" % (s, col, i, k, r)
            k += 1
            del l[0]
    print

Note

One important part to remember is that the upper bits of the shift amount need to specifically ignored rather than assuming that they are zeros, this is because, for shift instructions: a << b actually is a << (b % bit_len(a))

Static Partitioned Shift

Static shift is pretty straightforward: the input is the entire number shifted, however clearly due to partitioning some of the bits need to be masked out (to zero, or to 1 if it is an arithmetic shift right). This can therefore use the class currently known as "MultiShiftRMerge" or similar, which can merge in a 1 into the shifted data. Luckily the amount to be 0'd/1'd is also statically computable: it is just that the location where is dynamic, based on the partition location(s).

Bugreports