Simple-V (Parallelism Extension Proposal) Vector Block Format

  • Copyright (C) 2017, 2018, 2019 Luke Kenneth Casson Leighton
  • Status: DRAFTv0.7.1
  • Last edited: 2 sep 2019

Vector Block Format

This is a way to give Vector and Predication Context to a group of standard scalar RISC-V instructions, in a highly compact form. Program Execution Order is still preserved (unlike VLIW), just with "context" that would otherwise require much longer instructions.

The format is:

  • the standard RISC-V 80 to 192 bit encoding sequence, with bits defining the options to follow within the block
  • An optional VL Block (16-bit)
  • Optional predicate entries (8/16-bit blocks: see Predicate Table, above)
  • Optional register entries (8/16-bit blocks: see Register Table, above)
  • finally some 16/32/48 bit standard RV or SVPrefix opcodes follow.

Thus, the variable-length format from Section 1.5 of the RISC-V ISA is used as follows:

base+4 ... base+2 base number of bits
..xxxx xxxxxxxxxxxxxxxx xnnnxxxxx1111111 (80+16*nnn)-bit, nnn!=111
{ops}{Pred}{Reg}{VL Block} VBLOCK Prefix

A suitable prefix, which fits the Expanded Instruction-Length encoding for "(80 + 16 times instruction-length)", as defined in Section 1.5 of the RISC-V ISA, is as follows:

15 14:12 11:10 9 8 7 6:0
vlset 16xil rplen pplen pmode rmode 1111111

The VL/MAXVL/SubVL Block format, when 16xil != 0b111, is:

31:30 29:28 27:22 21 20:19 18:16 comment..................
0b00 SubVL imm[5:0] rsvd rd[4:0] sv.setvl rd, x0, imm
0b01 SubVL imm[5:0] rs1[2:0] rd[2:0] sv.setvl rd, rs1, imm (1)
0b10 SubVL imm[5:0] rsvd rs1[4:0] sv.setvl x0, rs1, imm
0b11 rsvd rsvd rsvd rsvd reserved, all 0s

Note (1) - Registers are in RVC format (x8-x15)

Note (2) - sv.setvl behaviour is expected, as if an sv.setvl instruction had actually been called.

When 16xil is 0b111, this is the "Extended" Format, using the >= 192-bit RISC-V ISA format. Note that the length is 96+16*nnnnn, not 192+

base+5 ... base+3 base+1 base no. of bits
..xxxx xxxxxxxxxxxxxxxx x111xxxxx1111111 96+16*nnnnn
{ops}{Pred}{Reg}{VL Block} VBLOCK2 VBLOCK Prefix

VBLOCK2 extends the VBLOCK fields:

15 14:12 11:10 9:8 7:5 4:0
rsvd mapsz rplen2 pplen2 swlen ilen
  • ilen is the instruction length (number of 16-bit blocks)
  • swlen specifies the number of "swizzle" blocks
  • rplen2 extends rplen by 2 bits
  • pplen2 extends pplen by 2 bits
  • mapsz indicates the size of the "remap" area. See table below for size
  • 1 bit is reserved for extensions

Mapsz to Remap size is in number of 16-bit blocks:

mapsz remap size
0 0
1 6
2 7
3 8
4 10
5 12
6 14
7 16

Note: The VL Block format is similar to that used in sv prefix proposal.

  • Mode 0b00: set VL to the immediate, truncated to not exceed MVL. Register rd is also set to the same value, if not x0.
  • Mode 0b01: follow sv.setvl rules, with RVC style registers in the range x8-x15 for rs1 and rd.
  • Mode 0b10: set both MVL and VL to the immediate. Register rd is also set if not x0.
  • Mode 0b11: reserved. All fields must be zero.

Mode 0b01 will typically be used to start vectorised loops, where the VBLOCK instruction effectively embeds an optional "SETSUBVL, SETVL" sequence (in compact form).

Modes 0b00 and 0b10 will typically not be used so much for loops as they will be for one-off instructions such as saving the entire register file to the stack with a single one-off Vectorised and predicated LD/ST, or as a way to save or restore registers in a function call with a single instruction.

Unlike in RVV, VL is set (within the limits of MVL) to exactly the value requested, specifically so that LD/ST-MULTI style behaviour can be done in a single instruction.

VBLOCK Prefix

The purpose of the VBLOCK Prefix is to specify the context in which a block of RV Scalar instructions are "vectorised" and/or predicated.

As there are not very many bits available without going into a prefix format longer than 16 bits, some abbreviations are used. Two bits are dedicated to specifying whether the Register and Predicate formats are 16 or 8 bit.

Also, the number of entries in each table is specified with an unusual encoding, on the basis that if registers are to be Vectorised, it is highly likely that they will be predicated as well.

The VL Block is optional and also only 16 bits: this because an RVC opcode is limited by comparison.

The format is explained as follows:

  • Bit 7 specifies if the register prefix block format is the full 16 bit format (1) or the compact less expressive format (0).
  • 8 bit format predicate numbering is implicit and begins from x9. Thus it is critical to put blocks in the correct order as required.
  • Bit 8 specifies if the predicate block format is 16 bit (1) or 8 bit (0).
  • Bit 15 specifies if the VL Block is present. If set to 1, the VL Block immediately follows the VBLOCK instruction Prefix
  • Bits 10 and 11 define how many RegCam entries (0,1,2,4 if bit 7 is 1, otherwise 0,2,4,8) follow the (optional) VL Block.
  • Bit 9 define how many PredCam entries follow the (optional) RegCam block. If pplen is 1, it is equal to rplen. Otherwise, half rplen, rounded up.
  • If the exact number of entries are not required, PredCam and RegCam entries may be set to all zero to indicate "unused" (no effect).
  • Bits 14 to 12 (IL) define the actual length of the instruction: total number of bits is 80 + 16 times IL. Standard RV32, RVC and also SVPrefix (P32C/P48/64-*-Type) instructions fit into this space, after the (optional) VL / RegCam / PredCam entries
  • In any RVC or 32 Bit opcode, any registers within the VBLOCK-prefixed format MUST have the RegCam and PredCam entries applied to the operation (and the Vectorisation loop activated)
  • P48 and P64 opcodes do not take their Register or predication context from the VBLOCK tables: they do however have VL or SUBVL applied (overridden when VLtyp or svlen are set).
  • At the end of the VBLOCK Group, the RegCam and PredCam entries no longer apply. VL, MAXVL and SUBVL on the other hand remain at the values set by the last instruction (whether a CSRRW or the VL Block header).
  • Although an inefficient use of resources, it is fine to set the MAXVL, VL and SUBVL CSRs with standard CSRRW instructions, within a VBLOCK.

All this would greatly reduce the amount of space utilised by Vectorised instructions, given that 64-bit CSRRW requires 3, even 4 32-bit opcodes: the CSR itself, a LI, and the setting up of the value into the RS register of the CSR, which, again, requires a LI / LUI to get the 32 bit data into the CSR. To get 64-bit data into the register in order to put it into the CSR(s), LOAD operations from memory are needed!

Given that each 64-bit CSR can hold only 4x PredCAM entries (or 4 RegCAM entries), that's potentially 6 to eight 32-bit instructions, just to establish the Vector State!

Not only that: even CSRRW on VL and MAXVL requires 64-bits (even more bits if VL needs to be set to greater than 32). Bear in mind that in SV, both MAXVL and VL need to be set.

By contrast, the VBLOCK prefix is only 16 bits, the VL/MAX/SubVL block is only 16 bits, and as long as not too many predicates and register vector qualifiers are specified, several 32-bit and 16-bit opcodes can fit into the format. If the full flexibility of the 16 bit block formats are not needed, more space is saved by using the 8 bit formats.

In this light, embedding the VL/MAXVL, PredCam and RegCam CSR entries into a VBLOCK format makes a lot of sense.

Bear in mind the warning in an earlier section that use of VLtyp or svlen in a P48 or P64 opcode within a VBLOCK Group will result in corruption (use) of the STATE CSR, as the STATE CSR is shared with SVPrefix. To avoid this situation, the STATE CSR may be copied into a temp register and restored afterwards.

Register Table Format

The register table format is covered in the main specification, included here for convenience:

16 bit format:

RegCAM 15 (14..8) 7 (6..5) (4..0)
0 isvec0 regidx0 i/f vew0 regkey0
1 isvec1 regidx1 i/f vew1 regkey1
2 isvec2 regidx2 i/f vew2 regkey2
3 isvec3 regidx3 i/f vew3 regkey3

8 bit format:

RegCAM 7 (6..5) (4..0)
0 i/f vew0 regnum

Showing the mapping (relationship) between 8-bit and 16-bit format:

RegCAM 15 (14..8) 7 (6..5) (4..0)
0 isvec=1 regnum0<<2 i/f vew0 regnum0
1 isvec=1 regnum1<<2 i/f vew1 regnum1
2 isvec=1 regnum2<<2 i/f vew2 regnum2
3 isvec=1 regnum3<<2 i/f vew3 regnum3

Predicate Table Format

The predicate table format is covered in the main specification, included here for convenience:

16 bit format:

PrCSR (15..11) 10 9 8 (7..1) 0
0 predidx zero0 inv0 i/f regidx ffirst0
1 predidx zero1 inv1 i/f regidx ffirst1
2 predidx zero2 inv2 i/f regidx ffirst2
3 predidx zero3 inv3 i/f regidx ffirst3

Note: predidx=x0, zero=1, inv=1 is a RESERVED encoding. Its use must generate an illegal instruction trap.

8 bit format:

PrCSR 7 6 5 (4..0)
0 zero0 inv0 i/f regnum

Mapping from 8 to 16 bit format, the table becomes:

PrCSR (15..11) 10 9 8 (7..1) 0
0 x9 zero0 inv0 i/f regnum ff=0
1 x10 zero1 inv1 i/f regnum ff=0
2 x11 zero2 inv2 i/f regnum ff=0
3 x12 zero3 inv3 i/f regnum ff=0

Swizzle Table Format

The swizzle table format is included here for convenience:

Swizzle format:

7:6 5:4 3:2 1:0
w z y x

Unsigned consts:

(1..0) type
0 0x00000
1 LSB Hi (0x00..001)
2 MSB Hi (0x10..000)
3 0xfff...ff

Signed consts:

(1..0) type
0 0x00000
1 LSB Hi (0x00..001)
2 MSB Hi (0x10..000)
3 0x7ff...ff

FP consts:

(1..0) type
0 0.0
1 1.0
2 0.5
3 pi

Type:

(2..0) type
0 xyzw
1 consts
2-7 rsvd

16 bit format:

SwzCAM (15..13) (12..8) (7..0)
0 type0 regidx0 swiz0
1 type1 regidx1 swiz1
2 type2 regidx2 swiz2
3 type3 regidx3 swiz3

Swizzle blocks are only accessible using the "VBLOCK2" format.

The swizzles activate on SUBVL and only when used in an operation where a register matches with a SwizzleCAM register entry.

On a match the register element index will be redirected through the swizzle format. If however the type is set to "constants" then instead of reading the register file the relevant constant is substituted instead.

Setting const type on a destination element will cause an illegal instruction.

REMAP Area Format

REMAP is an algorithmic version of in-place vector "vgather" or "swizzle".

The REMAP area is divided into two areas:

  • Register-to-SHAPE. This defines which registers have which shapes. Each entry is 8-bits in length.
  • SHAPE Table entries. These are 32-bits in length and are aligned to (start on) a 16 bit boundary.

REMAP Table Entries:

7:5 4:0
shapeidx regnum

When both shapeidx and regnum are zero, this indicates the end of the REMAP Register-to-SHAPE section. The REMAP Table section size is then aligned to a 16-bit boundary. 32-bit SHAPE Table Entries then fill the remainder of the REMAP area, and are indexed in order by shapeidx.

In this way, multiple registers may share the same "shape" characteristics.

SHAPE Table Format

The shape table format is included here for convenience. See remap for full details on how SHAPE applies, including pseudo-code.

Shape is 32-bits. When SHAPE is set entirely to zeros, remapping is disabled: the register's elements are a linear (1D) vector.

31..30 29..24 23..21 20..18 17..12 11..6 5..0
applydim offset invxyz permute zdimsz ydimsz xdimsz

applydim will set to zero the dimensions less than this. applydim=0 applies all three. applydim=1 applies y and z. applydim=2 applys only z. applydim=3 is reserved.

invxyz will invert the start index of each of x, y or z. If invxyz[0] is zero then x-dimensional counting begins from 0 and increments, otherwise it begins from xdimsz-1 and iterates down to zero. Likewise for y and z.

offset will have the effect equivalent to the sequential element loop to appear to run for offset (additional) iterations prior to actually generating output.

xdimsz, ydimsz and zdimsz are offset by 1, such that a value of 0 indicates that the array dimensionality for that dimension is 1. A value of xdimsz=2 would indicate that in the first dimension there are 3 elements in the array. The format of the array is therefore as follows:

array[xdim+1][ydim+1][zdim+1]

However whilst illustrative of the dimensionality, that does not take the "permute" setting into account. "permute" may be any one of six values (0-5, with values of 6 and 7 being reserved, and not legal). The table below shows how the permutation dimensionality order works:

permute order array format
000 0,1,2 (xdim+1)(ydim+1)(zdim+1)
001 0,2,1 (xdim+1)(zdim+1)(ydim+1)
010 1,0,2 (ydim+1)(xdim+1)(zdim+1)
011 1,2,0 (ydim+1)(zdim+1)(xdim+1)
100 2,0,1 (zdim+1)(xdim+1)(ydim+1)
101 2,1,0 (zdim+1)(ydim+1)(xdim+1)

In other words, the "permute" option changes the order in which nested for-loops over the array would be done.

REMAP Shape blocks are only accessible using the "VBLOCK2" format.

CSRs:

The CSRs needed, in addition to those from the main specification are:

  • pcvblk
  • mepcvblk
  • sepcvblk
  • uepcvblk
  • hepcvblk

To greatly simplify implementations, which would otherwise require a way to track (cache) VBLOCK instructions, it is required to treat the VBLOCK group as a separate sub-program with its own separate PC. The sub-pc advances separately whilst the main PC remains "frozen", pointing at the beginning of the VBLOCK instruction (not to be confused with how VL works, which is exactly the same principle, except it is VStart in the STATE CSR that increments).

This has implications, namely that a new set of CSRs identical to (x)epc (mepc, srpc, hepc and uepc) must be created and managed and respected as being a sub extension of the (x)epc set of CSRs. Thus, (x)epcvblk CSRs must be context switched and saved / restored in traps.

The srcoffs and destoffs indices in the STATE CSR may be similarly regarded as another sub-execution context, giving in effect two sets of nested sub-levels of the RISCV Program Counter (actually, three including SUBVL and ssvoffs).

PCVBLK CSR Format

Using PCVBLK to store the progression of decoding and subsequent execution of opcodes in a VBLOCK allows a simple single issue design to only need to fetch 32 or 64 bits from the instruction cache on any given clock cycle.

(This approach also alleviates one of the main concerns with the VBLOCK Format: unlike a VLIW engine, a FSM no longer requires full buffering of the entire VBLOCK opcode in order to begin execution. Future versions may therefore potentially lift the 192 bit limit).

To support this option (where more complex implementations may skip some of these phases), VBLOCK contains partial decode state, that allows a trap to occur even part-way through decode, in order to reduce latency.

The format is as follows:

31:30 29 28:26 25:24 23:22 21 20:5 4:0
status vlset 16xil pplen rplen mode vblock2 opptr
2 1 3 2 2 1 16 5
  • status is the key field that effectively exposes the inner FSM (Finite State Machine) directly.
  • status = 0b00 indicates that the processor is not in "VBLOCK Mode". It is instead in standard RV Scalar opcode execution mode. The processor will leave this mode only after it encounters the beginning of a valid VBLOCK opcode.
  • status=0b01 indicates that vlset, 16xil, pplen, rplen and mode have all been copied directly from the VBLOCK so that they do not need to be read again from the instruction stream, and that VBLOCK2 has also been read and stored, if 16xil was equal to 0b111.
  • status=0b10 indicates that the VL Block has been read from the instruction stream and actioned. (This means that a SETVL instruction has been created and executed). It also indicates that reading of the Predicate, Register and Swizzle Blocks are now being read.
  • status=0b11 indicates that the Predicate and Register Blocks have been read from the instruction stream (and put into internal Vector Context) Simpler implementations are permitted to reset status back to 0b10 and re-read the data after return from a trap that happened to occur in the middle of a VBLOCK. They are not however permitted to destroy opptr in the process, and after re-reading the Predicate and Register Blocks must resume execution pointed to by opptr.
  • opptr points to where instructions begin in the VBLOCK. 0 indicates the start of the opcodes (not the start of the VBLOCK), and is in multiples of 16 bits (2 bytes). This is the equivalent of a Program Counter, for VBLOCKs.
  • at the end of a VBLOCK, when the last instruction executes (assuming it does not change opptr to earlier in the block), status is reset to 0b00 to indicate exit from the VBLOCK FSM, and the current Vector Predicate and Register Context destroyed (Note: the STATE CSR is not altered purely by exit from a VBLOCK Context).

During the transition from status=0b00 to status=0b01, it is assumed that the instruction stream is being read at a mininum of 32 bits at a time. Therefore it is reasonable to expect that VBLOCK2 would be successfully read simultaneously with the initial VBLOCK header. For this reason there is no separate state in the FSM for updating of the vblock2 field in PCVBLK.

When the transition from status=0b01 to status=0b10 occurs, actioning the VL Block state actually and literally must be as if a SETVL instruction had occurred. This can result in updating of the VL and MVL CSRs (and the VL destination register target). Note, below, that this means that a context-switch may save/restore VL and MVL (and the integer register file), where the remaining tables have no such opportunity.

When status=0b10, and before status=0b11, there is no external indicator as to how far the hardware has got in the process of reading the Predicate, Register, and Swizzle Blocks. Implementations are free to use any internal means to track progress, however given that if a trap occurs the read process will need to be restarted (in simpler implementations), there is no point having external indicators of progress. By complete contrast, given that a SETVL actually writes to VL (and MVL), the VL Block state has been actioned and thus would be successfully restored by a context-switch.

When status=0b11, opptr may be written to using CSRRWI. Doing so will cause execution to jump within the block, exactly as if PC had been set in normal RISC-V execution. Writing a value outside of the range of the instruction block will cause an illegal instruction exception. Writing a value (any value) when status is not 0b11 likewise causes an illegal instruction exception. To be clear: CSRRWI PCVBLK does not have the same behaviour as CSRRW PCVBLK.

In privileged modes, obviously the above rules do not apply to the completely separate (x)ePCVBLK CSRs because these are (inactive) copies of state, not the actual active PCVBLK. Writing to PCVBLK during a trap however, clearly the rules must apply.

If PCVBLK is written to with CSRRW, the same rules apply, however the entire register in rs1 is treated as the new opptr.

Note that the value returned in the register rd is the full PCVBLK, not just the opptr part.

Limitations on instructions

As the pcvblk CSR is relative to the beginning of the VBLOCK, branch and jump opcodes MUST NOT be used to point to a location inside a block: only at the beginning of an opcode (including another VBLOCK, including the current one). However, setting the PCVBLK CSR is permitted, to unconditionally jump to any opcode within a block.

Also: calling subroutines is likewise not permitted, because PCVBLK context cannot be atomically reestablished on return from the function.

ECALL, on the other hand, which will cause a trap that saves and restores the full state, is permitted.

Prohibited instructions will cause an illegal instruction trap. If at that point, software is capable of then working out how to emulate a branch or function call successfully, by manipulating (x)ePCVBLK and other state, it is not prohibited from doing so.

To reiterate: a normal jump, normal conditional branch and a normal function call may only be taken by letting the VBLOCK group finish, returning to "normal" standard RV mode, and then using standard RVC, 32 bit or P48/64-*-type opcodes.

The exception to this rule is if the branch or jump within the VBLOCK is back to the start of the same VBLOCK. If this is used, the VBLOCK is, clearly, to be re-executed, including any optional VL blocks and any predication, register table context etc.

Given however that the tables are already established, it is only the VL block that needs to be re-run. The other tables may be left as-is.

Links

Open Questions:

  • Is it necessary to stick to the RISC-V 1.5 format? Why not go with using the 15th bit to allow 80 + 16*0bnnnn bits? Perhaps to be sane, limit to 256 bits (16 times 0-11).
  • Could a "hint" be used to set which operations are parallel and which are sequential?
  • Could a new sub-instruction opcode format be used, one that does not conform precisely to RISC-V rules, but unpacks to RISC-V opcodes? no need for byte or bit-alignment
  • Could a hardware compression algorithm be deployed? Quite likely, because of the sub-execution context (sub-VBLOCK PC)