NOTE

This section is under revision (and is optional)

REMAP CSR

(Note: both the REMAP and SHAPE sections are best read after the rest of the document has been read)

There is one 32-bit CSR which may be used to indicate which registers, if used in any operation, must be "reshaped" (re-mapped) from a linear form to a 2D or 3D transposed form, or "offset" to permit arbitrary access to elements within a register.

The 32-bit REMAP CSR may reshape up to 3 registers:

29..28 27..26 25..24 23 22..16 15 14..8 7 6..0
shape2 shape1 shape0 0 regidx2 0 regidx1 0 regidx0

regidx0-2 refer not to the Register CSR CAM entry but to the underlying real register (see regidx, the value) and consequently is 7-bits wide. When set to zero (referring to x0), clearly reshaping x0 is pointless, so is used to indicate "disabled". shape0-2 refers to one of three SHAPE CSRs. A value of 0x3 is reserved. Bits 7, 15, 23, 30 and 31 are also reserved, and must be set to zero.

It is anticipated that these specialist CSRs not be very often used. Unlike the CSR Register and Predication tables, the REMAP CSRs use the full 7-bit regidx so that they can be set once and left alone, whilst the CSR Register entries pointing to them are disabled, instead.

SHAPE 1D/2D/3D vector-matrix remapping CSRs

(Note: both the REMAP and SHAPE sections are best read after the rest of the document has been read)

There are three "shape" CSRs, SHAPE0, SHAPE1, SHAPE2, 32-bits in each, which have the same format. When each SHAPE CSR is set entirely to zeros, remapping is disabled: the register's elements are a linear (1D) vector.

26..24 23 22..16 15 14..8 7 6..0
permute offs[2] zdimsz offs[1] ydimsz offs[0] xdimsz

offs is a 3-bit field, spread out across bits 7, 15 and 23, which is added to the element index during the loop calculation.

xdimsz, ydimsz and zdimsz are offset by 1, such that a value of 0 indicates that the array dimensionality for that dimension is 1. A value of xdimsz=2 would indicate that in the first dimension there are 3 elements in the array. The format of the array is therefore as follows:

array[xdim+1][ydim+1][zdim+1]

However whilst illustrative of the dimensionality, that does not take the "permute" setting into account. "permute" may be any one of six values (0-5, with values of 6 and 7 being reserved, and not legal). The table below shows how the permutation dimensionality order works:

permute order array format
000 0,1,2 (xdim+1)(ydim+1)(zdim+1)
001 0,2,1 (xdim+1)(zdim+1)(ydim+1)
010 1,0,2 (ydim+1)(xdim+1)(zdim+1)
011 1,2,0 (ydim+1)(zdim+1)(xdim+1)
100 2,0,1 (zdim+1)(xdim+1)(ydim+1)
101 2,1,0 (zdim+1)(ydim+1)(xdim+1)

In other words, the "permute" option changes the order in which nested for-loops over the array would be done. The algorithm below shows this more clearly, and may be executed as a python program:

# mapidx = REMAP.shape2
xdim = 3 # SHAPE[mapidx].xdim_sz+1
ydim = 4 # SHAPE[mapidx].ydim_sz+1
zdim = 5 # SHAPE[mapidx].zdim_sz+1

lims = [xdim, ydim, zdim]
idxs = [0,0,0] # starting indices
order = [1,0,2] # experiment with different permutations, here
offs = 0        # experiment with different offsets, here

for idx in range(xdim * ydim * zdim):
    new_idx = offs + idxs[0] + idxs[1] * xdim + idxs[2] * xdim * ydim
    print new_idx,
    for i in range(3):
        idxs[order[i]] = idxs[order[i]] + 1
        if (idxs[order[i]] != lims[order[i]]):
            break
        print
        idxs[order[i]] = 0

Here, it is assumed that this algorithm be run within all pseudo-code throughout this document where a (parallelism) for-loop would normally run from 0 to VL-1 to refer to contiguous register elements; instead, where REMAP indicates to do so, the element index is run through the above algorithm to work out the actual element index, instead. Given that there are three possible SHAPE entries, up to three separate registers in any given operation may be simultaneously remapped:

function op_add(rd, rs1, rs2) # add not VADD!
  ...
  ...
  for (i = 0; i < VL; i++)
    xSTATE.srcoffs = i # save context
    if (predval & 1<<i) # predication uses intregs
       ireg[rd+remap(id)] <= ireg[rs1+remap(irs1)] +
                             ireg[rs2+remap(irs2)];
       if (!int_vec[rd ].isvector) break;
    if (int_vec[rd ].isvector)  { id += 1; }
    if (int_vec[rs1].isvector)  { irs1 += 1; }
    if (int_vec[rs2].isvector)  { irs2 += 1; }

By changing remappings, 2D matrices may be transposed "in-place" for one operation, followed by setting a different permutation order without having to move the values in the registers to or from memory. Also, the reason for having REMAP separate from the three SHAPE CSRs is so that in a chain of matrix multiplications and additions, for example, the SHAPE CSRs need only be set up once; only the REMAP CSR need be changed to target different registers.

Note that:

  • Over-running the register file clearly has to be detected and an illegal instruction exception thrown
  • When non-default elwidths are set, the exact same algorithm still applies (i.e. it offsets elements within registers rather than entire registers).
  • If permute option 000 is utilised, the actual order of the reindexing does not change!
  • If two or more dimensions are set to zero, the actual order does not change!
  • The above algorithm is pseudo-code only. Actual implementations will need to take into account the fact that the element for-looping must be re-entrant, due to the possibility of exceptions occurring. See MSTATE CSR, which records the current element index.
  • Twin-predicated operations require two separate and distinct element offsets. The above pseudo-code algorithm will be applied separately and independently to each, should each of the two operands be remapped. This even includes C.LDSP and other operations in that category, where in that case it will be the offset that is remapped (see Compressed Stack LOAD/STORE section).
  • Offset is especially useful, on its own, for accessing elements within the middle of a register. Without offsets, it is necessary to either use a predicated MV, skipping the first elements, or performing a LOAD/STORE cycle to memory. With offsets, the data does not have to be moved.
  • Setting the total elements (xdim+1) times (ydim+1) times (zdim+1) to less than MVL is perfectly legal, albeit very obscure. It permits entries to be regularly presented to operands more than once, thus allowing the same underlying registers to act as an accumulator of multiple vector or matrix operations, for example.

Clearly here some considerable care needs to be taken as the remapping could hypothetically create arithmetic operations that target the exact same underlying registers, resulting in data corruption due to pipeline overlaps. Out-of-order / Superscalar micro-architectures with register-renaming will have an easier time dealing with this than DSP-style SIMD micro-architectures.